I have previously successfully fine-tuned an NER/Token classification model on a custom dataset (roughly based on this tutorial) - however in attempting to shift to a new dataset I have come across an issue.
The new feature of the dataset, is that it contains numbers/dollar values e.g. of the form $123,000/1,000, $1,000,000 etc. The model is failing to classify these in full -and itâs made me realize I am probably fundamentally misunderstanding something about the sub-tokenization process - and was hoping someone could shed some light for me.
Here is an example few rows in my dataset
You will note that I have already split my sentences/documents into words - this was done simply on white space FYI - so generating my training data I am using is_split_into_words
in the tokenizer.
The model trains well - 0.87 F1 score on the B-asking_price
tag I am highlighting here.
However after I export the code as a pipeline, and try to run the full sentence (i.e. not broken up into words) through it in a different environment, I get the following result:
I am showing one example here, but this is pretty much consistent across all numbers I test it on. A couple of my thoughts:
- Itâs splits the number into sub-tokens passed on punctuation (
,
$
.
etc.) - but doesnât aggregate it again post-prediction. In the training data generation process I passed in$4,299,000
as a token (understanding that the tokenizer would further split it up into sub-tokens). Is there a way to similarly pass word by word into a predictionpipeline
so that it understands where to aggregate? - Itâs pretty consistent in that it will predict
B-<LABEL>
for the first token andB-<LABEL>
orI-<LABEL>
for the next tokens up until the first comma, after which it will predict theO
category with high confidence.
My uniformed guess is that the $4,
and the 299,000
are getting split up in the pre-processing step of training because of the punctuation, with the latter being assigned no label? I feel like the solution would be to enforce aggregation at the known word e.g. â$4,299,000â, and using aggregation_strategy=first
it would get assigned the correct label - but Iâm not sure if there is an OOTB way to do this?
Posting this here, because I feel like I missing a trick with how this sub-tokenization works & Iâm sure someone has had to deal with punctuation in numbers before. Any ideas welcome