I am working on a Named Entity Recognition (NER) problem, and I need tokenization to be quite precise in order to match tokens with per-token NER tags. I have the following sentence It costs 2.5 million.
, which I have already tokenized.
tokens = ['It', 'costs', '2.5', 'million.']
I then run the list through a BERT tokenizer using the is_split_into_words=True
option to get input IDs. When I try to reconstruct the original sentence using the tokenizer, I see that it has split the token 2.5
into the three tokens 2
, .
, and 5
. It also split the token million.
into two tokens million
and .
.
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
result = tokenizer(tokens, is_split_into_words=True)
print(result.input_ids)
# [101, 2009, 5366, 1016, 1012, 1019, 2454, 1012, 102]
print(tokenizer.decode(result.input_ids))
# [CLS] it costs 2. 5 million. [SEP]
print(tokenizer.convert_ids_to_tokens(result.input_ids))
# ['[CLS]', 'it', 'costs', '2', '.', '5', 'million', '.', '[SEP]']
I do not want that additional tokenization. Since I passed is_split_into_words=True
to the tokenizer, I was expecting that the tokenizer would treat each token as atomic and not do any further tokenization. I want the original string to be treated as four tokens ['It', 'costs', '2.5', 'million.']
so that the tokens lines up with my NER tags, where 2.5
has an NER tag of number
.
How would I got about fixing my problem? Thank you.