### System Info
Copy-and-paste the text below in your GitHub issue and FILL OâŚUT the two last points.
- `transformers` version: 4.35.2
- Platform: macOS-13.6.2-arm64-arm-64bit
- Python version: 3.11.6
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
### Who can help?
Blame gives roughly: @luccailliau @Narsil
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)
### Reproduction
```
from pprint import pprint
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner-with-dates")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner-with-dates")
nlp_no_agg = pipeline('ner', model=model, tokenizer=tokenizer)
nlp_simple = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp_first = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="first")
nlp_avg = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="average")
nlp_max = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="max")
for example in [
"Bonjour,je suis le docteur Brice Saintclair",
"Je vous renvoie en Dermatologie.",
]:
print(example)
print("no agg")
pprint(nlp_no_agg(example))
print("simple")
pprint(nlp_simple(example))
print("first")
pprint(nlp_first(example))
print("avg")
pprint(nlp_avg(example))
print("max")
pprint(nlp_max(example))
```
Result:
```
Bonjour,je suis le docteur Brice Saintclair
no agg
[{'end': 30,
'entity': 'I-PER',
'index': 7,
'score': 0.9949898,
'start': 26,
'word': 'âBri'},
{'end': 32,
'entity': 'I-PER',
'index': 8,
'score': 0.99483263,
'start': 30,
'word': 'ce'},
{'end': 38,
'entity': 'I-PER',
'index': 9,
'score': 0.9943815,
'start': 32,
'word': 'âSaint'},
{'end': 43,
'entity': 'I-PER',
'index': 10,
'score': 0.9938929,
'start': 38,
'word': 'clair'}]
simple
[{'end': 43,
'entity_group': 'PER',
'score': 0.9945242,
'start': 26,
'word': 'Brice Saintclair'}]
first
[{'end': 43,
'entity_group': 'PER',
'score': 0.99468565,
'start': 26,
'word': 'BriceSaintclair'}]
avg
[{'end': 43,
'entity_group': 'PER',
'score': 0.9945242,
'start': 26,
'word': 'BriceSaintclair'}]
max
[{'end': 43,
'entity_group': 'PER',
'score': 0.99468565,
'start': 26,
'word': 'BriceSaintclair'}]
Je vous renvoie en Dermatologie.
no agg
[{'end': 22,
'entity': 'I-ORG',
'index': 5,
'score': 0.46623757,
'start': 18,
'word': 'âDer'},
{'end': 25,
'entity': 'I-ORG',
'index': 6,
'score': 0.4892864,
'start': 22,
'word': 'mat'},
{'end': 31,
'entity': 'I-ORG',
'index': 7,
'score': 0.49201807,
'start': 25,
'word': 'ologie'}]
simple
[{'end': 31,
'entity_group': 'ORG',
'score': 0.48251402,
'start': 18,
'word': 'Dermatologie'}]
first
[{'end': 32,
'entity_group': 'ORG',
'score': 0.46623757,
'start': 18,
'word': 'Dermatologie.'}]
avg
[{'end': 32,
'entity_group': 'ORG',
'score': 0.3619019,
'start': 18,
'word': 'Dermatologie.'}]
max
[]
```
### Expected behavior
Given the non-aggregated results, it seems that there are 2 bugs:
- 1/ The space between `Brice Saintclair` is ommited when the tokens are fused by any aggregation strategy that is not "simple". I would expect the space to remain given that it's part of the tagged token.
- 2/ The period after "Dermatologie" is fused with it. It makes the whole word be classified as "0" with `max`. I would expect the period to be counted as outside the word given that it is its own token.