Tokenizer splits up pre-split tokens

I am working on a Named Entity Recognition (NER) problem, and I need tokenization to be quite precise in order to match tokens with per-token NER tags. I have the following sentence It costs 2.5 million., which I have already tokenized.

tokens = ['It', 'costs', '2.5', 'million.']

I then run the list through a BERT tokenizer using the is_split_into_words=True option to get input IDs. When I try to reconstruct the original sentence using the tokenizer, I see that it has split the token 2.5 into the three tokens 2, ., and 5. It also split the token million. into two tokens million and ..

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

result = tokenizer(tokens, is_split_into_words=True)

print(result.input_ids)
# [101, 2009, 5366, 1016, 1012, 1019, 2454, 1012, 102]

print(tokenizer.decode(result.input_ids))
# [CLS] it costs 2. 5 million. [SEP]

print(tokenizer.convert_ids_to_tokens(result.input_ids))
# ['[CLS]', 'it', 'costs', '2', '.', '5', 'million', '.', '[SEP]']

I do not want that additional tokenization. Since I passed is_split_into_words=True to the tokenizer, I was expecting that the tokenizer would treat each token as atomic and not do any further tokenization. I want the original string to be treated as four tokens ['It', 'costs', '2.5', 'million.'] so that the tokens lines up with my NER tags, where 2.5 has an NER tag of number.

How would I got about fixing my problem? Thank you.

2 Likes

Hi facehugger2020,

in order to fix your problem, you will need to train a BERT tokenizer yourself, and then train the BERT model too.

When a BERT model is created and pre-trained, it uses a particular vocabulary. For example, the standard bert-base-uncased model has a vocabulary of 30000 tokens. ā€œ2.5ā€ is not part of that vocabulary, so the BERT tokenizer splits it up into smaller units.

Training from scratch with the vocabulary you need is not impossible, but it will be tricky and probably expensive. Could you force your NER tags to fit with the BERT tokenization? For example, could you perform the tagging after the tokenization?

My training data is already pre-tokenized, and there is a 1-to-1 correspondence between token and NER tag. Thatā€™s why I donā€™t want the tokens to be split any further. I know that HuggingFace has an NER training example, but it reveals the same problem: tokens are broken down into subword pieces which do not match the NER tags.

Well, ā€˜2.5ā€™ is not in BERTā€™s vocabulary.
Building a tokenizer is an integral part of building a BERT model. The way the model learns depends on the tokens it uses.

You have 3 options.

  1. Change your pre-processing
  2. Create your own BERT with your own vocabulary
  3. Write some post-processing to re-align the NER tags

Do you have enough data to train a BERT model from scratch?

Do you definitely need to keep all your different numbers, or could you substitute a known token for each number before you pass the text to BERT?

Chris McCormick has some nice explanations of tokenization, in blogs and you-tube videos. See this blog for example https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

Thank you for your help. I think my only recourse is option (3), writing post-processing code to re-align subword tokens to NER tags. Thatā€™s the approach taken in the Huggingface NER training example, but itā€™s a bit inelegant (you need give the NER tag of -100 to subword pieces).

I already read all the McCormick tutorials on BERT. Thank you.

1 Like

Hi

I have the same issue, but the problem is not that BERT tokenizer splits some pre-split tokens, the problem is that it doesnā€™t show the new split in the output: thereā€™s no ## at the beginning of ā€˜.ā€™ and ā€˜5ā€™ in the given example, thus preventing for a simple post-processing. I donā€™t understand this behavior, and how one could expect a correct evaluation when itā€™s impossible to retrieve the original tokens ā€¦

I am facing the same problem, any solution up to date?

Hi,
I found a solution: you can use ā€œword_idsā€ to realign the new tokens and the original ones. Itā€™s a field of class transformers.BatchEncoding, that you can retrieve e.g. that way:

tokenized_inputs = tokenizer(tokens, truncation=True, is_split_into_words=True)
word_ids = tokenized_inputs.word_ids

Then you have access to a list of indices that indicate that two tokens come from the same word. My understanding is that BERT-like tokenizers (1) tokenize based on punctuation (I think?) and (2) ā€˜sub-tokenizeā€™ into subwords. They both make sense, in a way, but part (1) is less documented (especially the fact it happens even with the option saying that itā€™s already tokenized ā€¦), thereā€™s nothing like ## to indicate the split, and Iā€™ve had a hard time understanding what was going on.

It seems that ā€˜word_idsā€™ only works with ā€œFastā€ tokenizers as indicated here: Tokenizer

I use it as a post-processing. It still bothers me that the model doesnā€™t learn on ā€œmyā€ input tokenisation, while at the end it is evaluated on itā€¦

I hope it helps!

1 Like

Hi,
The ā€œword_idsā€ doesnā€™t work 2.5 which is split into 2,., and 5 have different word_ids, hence is no means to identify if they belong to a single entity. I am also in search of a solution.

Did you guys find any solution to this?

Iā€™m trying to ā€œrecombineā€ the embeddings of split words too.