Handling tokenization effects of punctuated numbers in NER (e.g. $10,000)

I have previously successfully fine-tuned an NER/Token classification model on a custom dataset (roughly based on this tutorial) - however in attempting to shift to a new dataset I have come across an issue.

The new feature of the dataset, is that it contains numbers/dollar values e.g. of the form $123,000/1,000, $1,000,000 etc. The model is failing to classify these in full -and it’s made me realize I am probably fundamentally misunderstanding something about the sub-tokenization process - and was hoping someone could shed some light for me.

Here is an example few rows in my dataset

You will note that I have already split my sentences/documents into words - this was done simply on white space FYI - so generating my training data I am using is_split_into_words in the tokenizer.

The model trains well - 0.87 F1 score on the B-asking_price tag I am highlighting here.

However after I export the code as a pipeline, and try to run the full sentence (i.e. not broken up into words) through it in a different environment, I get the following result:

I am showing one example here, but this is pretty much consistent across all numbers I test it on. A couple of my thoughts:

  1. It’s splits the number into sub-tokens passed on punctuation (, $ . etc.) - but doesn’t aggregate it again post-prediction. In the training data generation process I passed in $4,299,000 as a token (understanding that the tokenizer would further split it up into sub-tokens). Is there a way to similarly pass word by word into a prediction pipeline so that it understands where to aggregate?
  2. It’s pretty consistent in that it will predict B-<LABEL> for the first token and B-<LABEL> or I-<LABEL> for the next tokens up until the first comma, after which it will predict the O category with high confidence.

My uniformed guess is that the $4, and the 299,000 are getting split up in the pre-processing step of training because of the punctuation, with the latter being assigned no label? I feel like the solution would be to enforce aggregation at the known word e.g. “$4,299,000”, and using aggregation_strategy=first it would get assigned the correct label - but I’m not sure if there is an OOTB way to do this?

Posting this here, because I feel like I missing a trick with how this sub-tokenization works & I’m sure someone has had to deal with punctuation in numbers before. Any ideas welcome

So I dove in pretty deep & made many edits to the pipeline to work out the core issue & get something that worked. Afterwards I took a step-back and wrote something easier that simply wraps the output of calls to pipeline (see below)
Notes:

  1. Set aggregation_strategy=None in the pipeline
  2. x is the output of the huggingface token classification pipeline
  3. ‘desired_tokens’ is the list of tokens that you want to aggregate to, e.g. in my example above `[“Price:”, " $4,290,000", “Number”…]

I have learnt a-lot by diving in - but I have been left confused why is_split_into_words in the tokenizer doesn’t behave like I thought it would. Even though I explicitly pass $4,290,000 as it’s own word, the tokenizer still splits it up into it’s components - but without the ## that would classify them as sub-tokens to be re-aggregated.

    def aggregate(x, desired_tokens):

        
        joined_word = ''
        joined_group = []
        
        new_x = []
            
        for i in x:
            joined_word += i['word'].replace('#','')
            joined_group.append(i)
            if joined_word == desired_tokens[0]:
                new_i = {'entity':joined_group[0]['entity'], 'label':joined_group[0]['label'], 'word':joined_word, 
                         'start':joined_group[0]['start'], 'end':joined_group[-1]['end'], 'score':joined_group[0]['score']}
                new_x.append(new_i)
        
                joined_word = ''
                joined_group = []
                desired_tokens = desired_tokens[1:]
        return new_x