Text Classification tokenizer problems on inference

I just finned tunned a BERT model for text classification (NER), it works great!

However, I found some nuances associated with the tokenizer, i.e., when I use my model for inference, the resultant entities come with “#” due to the subword tokenizer I suppose.

In order to fix this, I used the "aggregation_strategy" parameter. This parameter can receive 5 different values: none, simple, first, average and max. When I use “None” or “simple” the code runs well but I still get the result as subwords with “#”. If i try to use " first, average or max" options, I get the following error:

 word_entities.append(self.aggregate_word(word_group, aggregation_strategy))
  Python\Python39\lib\site-packages\transformers\pipelines\token_classification.py", line 336, in aggregate_word
    word = self.tokenizer.convert_tokens_to_string([entity["word"] for entity in entities])
TypeError: 'NoneType' object is not iterable

Any idea how to fix this?

thank you

Hello, I ran into the same error!. I hope you found a way around it!..

No, I did not.

Luckily for me, I found that one of the input sentences was none. Therefoere, I have preprocessed the input sentence by replacing the empty lines with dummy text and it worked!.
Thanks for raising the issue and for your immediate respone!.

Ho yeah!
I remember now, that was the problem. I had some NONE inputs as well, you are right.
Removing them was the solution.

1 Like