Tokenizer splits up pre-split tokens

I am working on a Named Entity Recognition (NER) problem, and I need tokenization to be quite precise in order to match tokens with per-token NER tags. I have the following sentence It costs 2.5 million., which I have already tokenized.

tokens = ['It', 'costs', '2.5', 'million.']

I then run the list through a BERT tokenizer using the is_split_into_words=True option to get input IDs. When I try to reconstruct the original sentence using the tokenizer, I see that it has split the token 2.5 into the three tokens 2, ., and 5. It also split the token million. into two tokens million and ..

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

result = tokenizer(tokens, is_split_into_words=True)

# [101, 2009, 5366, 1016, 1012, 1019, 2454, 1012, 102]

# [CLS] it costs 2. 5 million. [SEP]

# ['[CLS]', 'it', 'costs', '2', '.', '5', 'million', '.', '[SEP]']

I do not want that additional tokenization. Since I passed is_split_into_words=True to the tokenizer, I was expecting that the tokenizer would treat each token as atomic and not do any further tokenization. I want the original string to be treated as four tokens ['It', 'costs', '2.5', 'million.'] so that the tokens lines up with my NER tags, where 2.5 has an NER tag of number.

How would I got about fixing my problem? Thank you.

Hi facehugger2020,

in order to fix your problem, you will need to train a BERT tokenizer yourself, and then train the BERT model too.

When a BERT model is created and pre-trained, it uses a particular vocabulary. For example, the standard bert-base-uncased model has a vocabulary of 30000 tokens. “2.5” is not part of that vocabulary, so the BERT tokenizer splits it up into smaller units.

Training from scratch with the vocabulary you need is not impossible, but it will be tricky and probably expensive. Could you force your NER tags to fit with the BERT tokenization? For example, could you perform the tagging after the tokenization?

My training data is already pre-tokenized, and there is a 1-to-1 correspondence between token and NER tag. That’s why I don’t want the tokens to be split any further. I know that HuggingFace has an NER training example, but it reveals the same problem: tokens are broken down into subword pieces which do not match the NER tags.

Well, ‘2.5’ is not in BERT’s vocabulary.
Building a tokenizer is an integral part of building a BERT model. The way the model learns depends on the tokens it uses.

You have 3 options.

  1. Change your pre-processing
  2. Create your own BERT with your own vocabulary
  3. Write some post-processing to re-align the NER tags

Do you have enough data to train a BERT model from scratch?

Do you definitely need to keep all your different numbers, or could you substitute a known token for each number before you pass the text to BERT?

Chris McCormick has some nice explanations of tokenization, in blogs and you-tube videos. See this blog for example

Thank you for your help. I think my only recourse is option (3), writing post-processing code to re-align subword tokens to NER tags. That’s the approach taken in the Huggingface NER training example, but it’s a bit inelegant (you need give the NER tag of -100 to subword pieces).

I already read all the McCormick tutorials on BERT. Thank you.