Text Classification tokenizer problems on inference

lfcc · May 7, 2022, 6:49pm

I just finned tunned a BERT model for text classification (NER), it works great!

However, I found some nuances associated with the tokenizer, i.e., when I use my model for inference, the resultant entities come with “#” due to the subword tokenizer I suppose.

In order to fix this, I used the "aggregation_strategy" parameter. This parameter can receive 5 different values: none, simple, first, average and max. When I use “None” or “simple” the code runs well but I still get the result as subwords with “#”. If i try to use " first, average or max" options, I get the following error:

 word_entities.append(self.aggregate_word(word_group, aggregation_strategy))
  Python\Python39\lib\site-packages\transformers\pipelines\token_classification.py", line 336, in aggregate_word
    word = self.tokenizer.convert_tokens_to_string([entity["word"] for entity in entities])
TypeError: 'NoneType' object is not iterable

Any idea how to fix this?

thank you

ibrahimishag · October 12, 2022, 4:07am

Hello, I ran into the same error!. I hope you found a way around it!..

lfcc · October 12, 2022, 7:05am

No, I did not.

ibrahimishag · October 12, 2022, 7:57am

Luckily for me, I found that one of the input sentences was none. Therefoere, I have preprocessed the input sentence by replacing the empty lines with dummy text and it worked!.
Thanks for raising the issue and for your immediate respone!.

lfcc · October 12, 2022, 8:38am

Ho yeah!
I remember now, that was the problem. I had some NONE inputs as well, you are right.
Removing them was the solution.

Topic		Replies	Views
TokenClassificationPipeline produce entities with "##" characters 🤗Transformers	6	25	May 19, 2025
Empty entity string when using TokenClassificationPipeline 🤗Transformers	1	582	February 15, 2022
Inference API - Sub-words display for Token Classification 🤗Hub	0	375	June 25, 2023
How do we reassemble sub tokens when running a token classification model in inference with a sentence? 🤗Transformers	2	818	January 4, 2023
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6678	February 9, 2024

Text Classification tokenizer problems on inference

Related topics