I just finned tunned a BERT model for text classification (NER), it works great!
However, I found some nuances associated with the tokenizer, i.e., when I use my model for inference, the resultant entities come with “#” due to the subword tokenizer I suppose.
In order to fix this, I used the "aggregation_strategy"
parameter. This parameter can receive 5 different values: none, simple, first, average and max. When I use “None” or “simple” the code runs well but I still get the result as subwords with “#”. If i try to use " first, average or max" options, I get the following error:
word_entities.append(self.aggregate_word(word_group, aggregation_strategy))
Python\Python39\lib\site-packages\transformers\pipelines\token_classification.py", line 336, in aggregate_word
word = self.tokenizer.convert_tokens_to_string([entity["word"] for entity in entities])
TypeError: 'NoneType' object is not iterable
Any idea how to fix this?
thank you