Empty entity string when using TokenClassificationPipeline

Joshwabail · February 14, 2022, 2:12pm

Hello, I’ve got an issue with using the TokenClassificationPipeline. Essentially I’m getting empty strings as the entity classification for some tokens when I use any aggregation strategy other than, 'none'. When I use 'none' the model will pick the max value from the raw vectors as described (and will actually be correct), the issue with that is that I’m then getting all words broken into their constituent subwords. I was thinking of writing my own aggregator for the output to at least bring the subwords together but the model also does a good job at bringing multi word expressions together too so the ideal would be figuring out why the model is giving empty classification.

The model in general is very good when it gives a classification for a token so I’m just confused as to why when aggregating, it sometimes falls short, even when the words in question have no subwords.

Below is an example of the outputs

With 'first' aggregation:

[{'entity_group': 'Z',
  'score': 0.99955326,
  'word': 'we',
  'start': 0,
  'end': 2},
 {'entity_group': 'A',
  'score': 0.999987,
  'word': 'need',
  'start': 3,
  'end': 7},
 {'entity_group': 'Z',
  'score': 0.99968517,
  'word': 'some',
  'start': 8,
  'end': 12},
 {'entity_group': 'N',
  'score': 0.999977,
  'word': 'more',
  'start': 13,
  'end': 17},
 {'entity_group': '',
  'score': 0.99999094,
  'word': 'artificial',
  'start': 18,
  'end': 28},
 {'entity_group': 'F',
  'score': 0.99971086,
  'word': 'sweetener',
  'start': 29,
  'end': 38},
 {'entity_group': 'Z',
  'score': 0.9994038,
  'word': 'for our',
  'start': 39,
  'end': 46},
 {'entity_group': 'F',
  'score': 0.9999933,
  'word': 'coffee',
  'start': 47,
  'end': 53}]

With 'none' aggregation:

[{'entity': 'Z',
  'score': 0.99955326,
  'index': 1,
  'word': 'we',
  'start': 0,
  'end': 2},
 {'entity': 'A',
  'score': 0.999987,
  'index': 2,
  'word': 'need',
  'start': 3,
  'end': 7},
 {'entity': 'Z',
  'score': 0.99968517,
  'index': 3,
  'word': 'some',
  'start': 8,
  'end': 12},
 {'entity': 'N',
  'score': 0.999977,
  'index': 4,
  'word': 'more',
  'start': 13,
  'end': 17},
 {'entity': 'A',
  'score': 0.99999094,
  'index': 5,
  'word': 'artificial',
  'start': 18,
  'end': 28},
 {'entity': 'F',
  'score': 0.99968886,
  'index': 6,
  'word': 'sweet',
  'start': 29,
  'end': 34},
 {'entity': 'F',
  'score': 0.99971086,
  'index': 7,
  'word': '##ener',
  'start': 34,
  'end': 38},
 {'entity': 'Z',
  'score': 0.9998388,
  'index': 8,
  'word': 'for',
  'start': 39,
  'end': 42},
 {'entity': 'Z',
  'score': 0.9989687,
  'index': 9,
  'word': 'our',
  'start': 43,
  'end': 46},
 {'entity': 'F',
  'score': 0.9999933,
  'index': 10,
  'word': 'coffee',
  'start': 47,
  'end': 53}]

Hopefully this makes it more clear as to what is happening.

Any ideas and help would be greatly appreciated.

Thanks!

Joshwabail · February 15, 2022, 4:29pm

As a supplement to this, the model I finetuned was a distilbert model.

Topic		Replies	Views
Text Classification tokenizer problems on inference Intermediate	4	2183	October 12, 2022
Why is aggregation_strategy="simple" not combining subwords properly in Hugging Face token classification (DeBERTa fine-tuned model) Intermediate	0	11	April 22, 2025
TokenClassification pipeline doing batch processing over a sequence of already tokenised messages Intermediate	1	829	July 6, 2022
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	324	September 29, 2021
Get_all_scores for TokenClassificationPipeline? 🤗Transformers	0	332	July 8, 2023

Empty entity string when using TokenClassificationPipeline

Related topics