Handling tokenization effects of punctuated numbers in NER (e.g. $10,000)

So I dove in pretty deep & made many edits to the pipeline to work out the core issue & get something that worked. Afterwards I took a step-back and wrote something easier that simply wraps the output of calls to pipeline (see below)
Notes:

  1. Set aggregation_strategy=None in the pipeline
  2. x is the output of the huggingface token classification pipeline
  3. ‘desired_tokens’ is the list of tokens that you want to aggregate to, e.g. in my example above `[“Price:”, " $4,290,000", “Number”…]

I have learnt a-lot by diving in - but I have been left confused why is_split_into_words in the tokenizer doesn’t behave like I thought it would. Even though I explicitly pass $4,290,000 as it’s own word, the tokenizer still splits it up into it’s components - but without the ## that would classify them as sub-tokens to be re-aggregated.

    def aggregate(x, desired_tokens):

        
        joined_word = ''
        joined_group = []
        
        new_x = []
            
        for i in x:
            joined_word += i['word'].replace('#','')
            joined_group.append(i)
            if joined_word == desired_tokens[0]:
                new_i = {'entity':joined_group[0]['entity'], 'label':joined_group[0]['label'], 'word':joined_word, 
                         'start':joined_group[0]['start'], 'end':joined_group[-1]['end'], 'score':joined_group[0]['score']}
                new_x.append(new_i)
        
                joined_word = ''
                joined_group = []
                desired_tokens = desired_tokens[1:]
        return new_x
1 Like