I have previously successfully fine-tuned an NER/Token classification model on a custom dataset (roughly based on this tutorial) - however in attempting to shift to a new dataset I have come across an issue.
The new feature of the dataset, is that it contains numbers/dollar values e.g. of the form $123,000/1,000, $1,000,000 etc. The model is failing to classify these in full -and it’s made me realize I am probably fundamentally misunderstanding something about the sub-tokenization process - and was hoping someone could shed some light for me.
Here is an example few rows in my dataset
You will note that I have already split my sentences/documents into words - this was done simply on white space FYI - so generating my training data I am using
is_split_into_words in the tokenizer.
The model trains well - 0.87 F1 score on the
B-asking_price tag I am highlighting here.
However after I export the code as a pipeline, and try to run the full sentence (i.e. not broken up into words) through it in a different environment, I get the following result:
I am showing one example here, but this is pretty much consistent across all numbers I test it on. A couple of my thoughts:
- It’s splits the number into sub-tokens passed on punctuation (
.etc.) - but doesn’t aggregate it again post-prediction. In the training data generation process I passed in
$4,299,000as a token (understanding that the tokenizer would further split it up into sub-tokens). Is there a way to similarly pass word by word into a prediction
pipelineso that it understands where to aggregate?
- It’s pretty consistent in that it will predict
B-<LABEL>for the first token and
I-<LABEL>for the next tokens up until the first comma, after which it will predict the
Ocategory with high confidence.
My uniformed guess is that the
$4, and the
299,000 are getting split up in the pre-processing step of training because of the punctuation, with the latter being assigned no label? I feel like the solution would be to enforce aggregation at the known word e.g. “$4,299,000”, and using
aggregation_strategy=first it would get assigned the correct label - but I’m not sure if there is an OOTB way to do this?
Posting this here, because I feel like I missing a trick with how this sub-tokenization works & I’m sure someone has had to deal with punctuation in numbers before. Any ideas welcome