DistilBert tokenization does not work as expected

mkarlos · October 13, 2023, 9:38pm

Hi.

I’m enriching DistilBert tokenizer with new tokens from new corpus. DistilBert uses WordPiece tokenizer, and based on the Huggingface NLP course, the inference is done by finding the “longest possible token” from the beginning of the word, splitting it, and doing the same for the rest of the word.

In my tokenizer, however, I have inspect, insp, ec, ##ec, ##t tokens, but when tokenizing inspect, the tokenizer comes up with the following tokens: ['insp', 'ec', '##t'].

I would expect the tokenizer to return only one token: 'inspect'. Even if it splits, I would expect it to return at least ['insp', '##ec', '##t'].

Is this a bug or some of the part of my code is incorrect?

Minimum working example:

>> from transformers import AutoTokenizer

>> model_checkpoint = 'elastic/distilbert-base-uncased-finetuned-conll03-english'
>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, False, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'inspect', '[SEP]']

>> tokenizer.add_tokens(['insp'])
# 1
>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, True, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'insp', 'ec', '##t', '[SEP]']

Topic		Replies	Views
Receiving Error When trying to Tokenize Dataset with Distilbert Beginners	0	1948	August 28, 2022
Custom DistilBertTokenizer training 🤗Transformers	3	658	November 13, 2020
DistilBERT and CLS token Beginners	2	2449	February 21, 2021
Importing a DistilBertTokenizer does not work using AutoTokenizer Beginners	0	653	November 8, 2023
I get the predicted token as ` े` . What am I doing wrong? 🤗Tokenizers	1	614	March 27, 2023

DistilBert tokenization does not work as expected

Related topics