Handling tokenization effects of punctuated numbers in NER (e.g. $10,000)

ecatkins · February 11, 2022, 4:16am

So I dove in pretty deep & made many edits to the pipeline to work out the core issue & get something that worked. Afterwards I took a step-back and wrote something easier that simply wraps the output of calls to pipeline (see below)
Notes:

Set aggregation_strategy=None in the pipeline
x is the output of the huggingface token classification pipeline
‘desired_tokens’ is the list of tokens that you want to aggregate to, e.g. in my example above `[“Price:”, " $4,290,000", “Number”…]

I have learnt a-lot by diving in - but I have been left confused why is_split_into_words in the tokenizer doesn’t behave like I thought it would. Even though I explicitly pass $4,290,000 as it’s own word, the tokenizer still splits it up into it’s components - but without the ## that would classify them as sub-tokens to be re-aggregated.

    def aggregate(x, desired_tokens):

        
        joined_word = ''
        joined_group = []
        
        new_x = []
            
        for i in x:
            joined_word += i['word'].replace('#','')
            joined_group.append(i)
            if joined_word == desired_tokens[0]:
                new_i = {'entity':joined_group[0]['entity'], 'label':joined_group[0]['label'], 'word':joined_word, 
                         'start':joined_group[0]['start'], 'end':joined_group[-1]['end'], 'score':joined_group[0]['score']}
                new_x.append(new_i)
        
                joined_word = ''
                joined_group = []
                desired_tokens = desired_tokens[1:]
        return new_x

Topic		Replies	Views
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6612	February 9, 2024
Text Classification tokenizer problems on inference Intermediate	4	2263	October 12, 2022
BERT for NER output of only '0' Beginners	0	670	November 14, 2021
Fine tuning token classifier with very long sequences Beginners	0	332	March 26, 2023
Feature to highlight or color code the text from the NER output of token classification having offsets using python 🤗Transformers	1	2476	October 20, 2022

Handling tokenization effects of punctuated numbers in NER (e.g. $10,000)

Related topics