I am working on a Named Entity Recognition (NER) problem, and I need tokenization to be quite precise in order to match tokens with per-token NER tags. I have the following sentence
It costs 2.5 million., which I have already tokenized.
tokens = ['It', 'costs', '2.5', 'million.']
I then run the list through a BERT tokenizer using the
is_split_into_words=True option to get input IDs. When I try to reconstruct the original sentence using the tokenizer, I see that it has split the token
2.5 into the three tokens
5. It also split the token
million. into two tokens
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True) result = tokenizer(tokens, is_split_into_words=True) print(result.input_ids) # [101, 2009, 5366, 1016, 1012, 1019, 2454, 1012, 102] print(tokenizer.decode(result.input_ids)) # [CLS] it costs 2. 5 million. [SEP] print(tokenizer.convert_ids_to_tokens(result.input_ids)) # ['[CLS]', 'it', 'costs', '2', '.', '5', 'million', '.', '[SEP]']
I do not want that additional tokenization. Since I passed
is_split_into_words=True to the tokenizer, I was expecting that the tokenizer would treat each token as atomic and not do any further tokenization. I want the original string to be treated as four tokens
['It', 'costs', '2.5', 'million.'] so that the tokens lines up with my NER tags, where
2.5 has an NER tag of
How would I got about fixing my problem? Thank you.