Tokenizer splits up pre-split tokens

facehugger2020 · November 16, 2020, 5:59am

I am working on a Named Entity Recognition (NER) problem, and I need tokenization to be quite precise in order to match tokens with per-token NER tags. I have the following sentence It costs 2.5 million., which I have already tokenized.

tokens = ['It', 'costs', '2.5', 'million.']

I then run the list through a BERT tokenizer using the is_split_into_words=True option to get input IDs. When I try to reconstruct the original sentence using the tokenizer, I see that it has split the token 2.5 into the three tokens 2, ., and 5. It also split the token million. into two tokens million and ..

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

result = tokenizer(tokens, is_split_into_words=True)

print(result.input_ids)
# [101, 2009, 5366, 1016, 1012, 1019, 2454, 1012, 102]

print(tokenizer.decode(result.input_ids))
# [CLS] it costs 2. 5 million. [SEP]

print(tokenizer.convert_ids_to_tokens(result.input_ids))
# ['[CLS]', 'it', 'costs', '2', '.', '5', 'million', '.', '[SEP]']

I do not want that additional tokenization. Since I passed is_split_into_words=True to the tokenizer, I was expecting that the tokenizer would treat each token as atomic and not do any further tokenization. I want the original string to be treated as four tokens ['It', 'costs', '2.5', 'million.'] so that the tokens lines up with my NER tags, where 2.5 has an NER tag of number.

How would I got about fixing my problem? Thank you.

rgwatwormhill · November 17, 2020, 7:53pm

Hi facehugger2020,

in order to fix your problem, you will need to train a BERT tokenizer yourself, and then train the BERT model too.

When a BERT model is created and pre-trained, it uses a particular vocabulary. For example, the standard bert-base-uncased model has a vocabulary of 30000 tokens. “2.5” is not part of that vocabulary, so the BERT tokenizer splits it up into smaller units.

Training from scratch with the vocabulary you need is not impossible, but it will be tricky and probably expensive. Could you force your NER tags to fit with the BERT tokenization? For example, could you perform the tagging after the tokenization?

facehugger2020 · November 18, 2020, 2:53am

My training data is already pre-tokenized, and there is a 1-to-1 correspondence between token and NER tag. That’s why I don’t want the tokens to be split any further. I know that HuggingFace has an NER training example, but it reveals the same problem: tokens are broken down into subword pieces which do not match the NER tags.

rgwatwormhill · November 18, 2020, 10:46am

Well, ‘2.5’ is not in BERT’s vocabulary.
Building a tokenizer is an integral part of building a BERT model. The way the model learns depends on the tokens it uses.

You have 3 options.

Change your pre-processing
Create your own BERT with your own vocabulary
Write some post-processing to re-align the NER tags

Do you have enough data to train a BERT model from scratch?

Do you definitely need to keep all your different numbers, or could you substitute a known token for each number before you pass the text to BERT?

Chris McCormick has some nice explanations of tokenization, in blogs and you-tube videos. See this blog for example https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

facehugger2020 · November 18, 2020, 7:10pm

Thank you for your help. I think my only recourse is option (3), writing post-processing code to re-align subword tokens to NER tags. That’s the approach taken in the Huggingface NER training example, but it’s a bit inelegant (you need give the NER tag of -100 to subword pieces).

I already read all the McCormick tutorials on BERT. Thank you.

kloee · March 30, 2023, 3:44pm

Hi

I have the same issue, but the problem is not that BERT tokenizer splits some pre-split tokens, the problem is that it doesn’t show the new split in the output: there’s no ## at the beginning of ‘.’ and ‘5’ in the given example, thus preventing for a simple post-processing. I don’t understand this behavior, and how one could expect a correct evaluation when it’s impossible to retrieve the original tokens …

fredymad · April 9, 2023, 2:29pm

I am facing the same problem, any solution up to date?

kloee · April 9, 2023, 7:34pm

Hi,
I found a solution: you can use “word_ids” to realign the new tokens and the original ones. It’s a field of class transformers.BatchEncoding, that you can retrieve e.g. that way:

tokenized_inputs = tokenizer(tokens, truncation=True, is_split_into_words=True)
word_ids = tokenized_inputs.word_ids

Then you have access to a list of indices that indicate that two tokens come from the same word. My understanding is that BERT-like tokenizers (1) tokenize based on punctuation (I think?) and (2) ‘sub-tokenize’ into subwords. They both make sense, in a way, but part (1) is less documented (especially the fact it happens even with the option saying that it’s already tokenized …), there’s nothing like ## to indicate the split, and I’ve had a hard time understanding what was going on.

It seems that ‘word_ids’ only works with “Fast” tokenizers as indicated here: Tokenizer

I use it as a post-processing. It still bothers me that the model doesn’t learn on “my” input tokenisation, while at the end it is evaluated on it…

I hope it helps!

theclueless · November 23, 2023, 5:52am

Hi,
The “word_ids” doesn’t work 2.5 which is split into 2,., and 5 have different word_ids, hence is no means to identify if they belong to a single entity. I am also in search of a solution.

giannigi · February 9, 2024, 2:04pm

Did you guys find any solution to this?

I’m trying to “recombine” the embeddings of split words too.

Topic		Replies	Views
Bert pretrained tokenizer: how to preserve hyphened words? Beginners	0	300	April 6, 2022
Sentence splitting 🤗Tokenizers	7	28809	September 15, 2022
Chapter.6 - Why are the tokens and word_ids for 2nd sentence are not returned? Course	0	432	January 3, 2023
Pre-training a BERT model from scratch with custom tokenizer Intermediate	5	2749	January 11, 2022
How tokenize natural words by using Tokenizer from transformer pretrained models 🤗Transformers	0	218	November 23, 2022

Tokenizer splits up pre-split tokens

Related Topics