TokenClassificationPipeline produce entities with "##" characters

Hi,

When I use TokenClassificationPipeline with some models I get entities with ## characters. For instance with the “elastic/distilbert-base-cased-finetuned-conll03-english” model.

It looks like tokenization issue may be related to accented characters.
Any ideas in order to explain this issue and how to fix it ?

Here is my code and sample output.

NER_MODEL_NAME = ...

tokenizer = AutoTokenizer.from_pretrained(NER_MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(NER_MODEL_NAME)

# Create a pipeline for NER
ner_pipeline = TokenClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Run NER
text = "Côme habite à Aix-en-Provence et travaille pour l’INRIA."
entities = ner_pipeline(text)

# Print results
for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

And the output with 2 models

With NER_MODEL_NAME = ‘CATIE-AQ/NERmembert-base-4entities’
Entities are correct :

    CĂ´me -> PER (0.67)
    Aix-en-Provence -> LOC (1.00)
    INRIA -> ORG (1.00)

With NER_MODEL_NAME = ‘elastic/distilbert-base-cased-finetuned-conll03-english’
Entities are not correct :

    C -> PER (0.50)
    ##Ă´me -> ORG (0.40)
    Ai -> LOC (0.62)
    ##x -> LOC (0.97)
    - -> LOC (0.64)
    en -> LOC (0.91)
    - -> LOC (0.81)
    Provence -> LOC (0.87)
    et travaille pour l ’ INRIA -> ORG (0.89)

Regards.

Dominique

1 Like

It seems you’ve unearthed an ancient bug. A living fossil of a bug.

from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

NER_MODEL_NAME = "elastic/distilbert-base-cased-finetuned-conll03-english"
"""
C -> PER (0.50)
##Ă´me -> ORG (0.40)
Ai -> LOC (0.62)
##x -> LOC (0.97)
- -> LOC (0.64)
en -> LOC (0.91)
- -> LOC (0.81)
Provence -> LOC (0.87)
et travaille pour l ’ INRIA -> ORG (0.89)
"""
NER_MODEL_NAME = "CATIE-AQ/NERmembert-base-4entities"
"""
CĂ´me -> PER (0.67)
Aix-en-Provence -> LOC (1.00)
INRIA -> ORG (1.00)
"""
NER_MODEL_NAME = "elastic/distilbert-base-uncased-finetuned-conll03-english"
# aix - en - provence et travaille pour l ’ inria -> ORG (0.87)
NER_MODEL_NAME = "distilbert/distilbert-base-multilingual-cased"
"""
CĂ´me -> LABEL_0 (0.57)
habite -> LABEL_1 (0.53)
Ă  -> LABEL_0 (0.53)
Aix - en - -> LABEL_1 (0.52)
Provence et travaille pour l -> LABEL_0 (0.55)
’ -> LABEL_1 (0.51)
INRIA. -> LABEL_0 (0.55)
"""
NER_MODEL_NAME = "distilbert/distilbert-base-uncased"
"""
come -> LABEL_0 (0.54)
habit -> LABEL_1 (0.53)
##e -> LABEL_0 (0.52)
a aix -> LABEL_1 (0.52)
- -> LABEL_0 (0.52)
en - provence et -> LABEL_1 (0.55)
tr -> LABEL_0 (0.51)
##ava -> LABEL_1 (0.50)
##ille -> LABEL_0 (0.51)
pour -> LABEL_1 (0.54)
l -> LABEL_0 (0.52)
’ -> LABEL_1 (0.52)
inria -> LABEL_0 (0.52)
. -> LABEL_1 (0.50)
"""

tokenizer = AutoTokenizer.from_pretrained(NER_MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(NER_MODEL_NAME).to("cuda")

# Create a pipeline for NER
ner_pipeline = TokenClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    #aggregation_strategy="first" # this would work
    aggregation_strategy="simple"
)

# Run NER
text = "Côme habite à Aix-en-Provence et travaille pour l’INRIA."
entities = ner_pipeline(text)

# Print results
for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

Hi,
The issue is with aggregation_strategy=“simple”. Other strategies are not returning ## but entities are not satisfactory.
‘elastic/distilbert-base-cased-finetuned-conll03-english’ model works fine directly embeded into elasticsearch ingest pipeline.
Dominique

1 Like

Why not use French NER?

1 Like

It may be difficult to fix this issue on the library side.
He said to change the model.

In fact, I don’t really need to use this model for english text, I just tested it and don’t understand why it works fine directly in ES and it doesn’t work with hugging face API. So, I would like to know if I did something wrong.

Any way, as my text is in French, I use either
NER_MODEL_NAME = “Jean-Baptiste/camembert-ner”
or
NER_MODEL_NAME = ‘CATIE-AQ/NERmembert-base-4entities’

“Jean-Baptiste/camembert-ner” is the best for my use vase.

Dominique

1 Like

I’m not sure if this “simple” operation is as expected, but Tokenizer seems to be helpful in adding prefixes.

The output of this method is a list of strings, or tokens:

[‘Using’, ‘a’, ‘transform’, ‘##er’, ‘network’, ‘is’, ‘simple’]

This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with transformer, which is split into two tokens: transform and ##er.

https://stackoverflow.com/questions/67026731/is-there-a-way-to-use-huggingface-pretrained-tokenizer-with-wordpiece-prefix

1 Like