WordLevel error: Missing [UNK] token from the vocabulary

Hi, I am trying to train a basic Word Level tokenizer based on a file data.txt containing

5174 5155 4749 4814 4832 4761 4523 4999 4860 4699 5024 4788 [UNK]

When I run my code

from tokenizers import Tokenizer
from tokenizers.models import WordLevel

tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))
tokenizer.train(files=['data.txt'])
tokenizer.encode('5155')

I get the error

Exception: WordLevel error: Missing [UNK] token from the vocabulary

Why is it still missing despite having [UNK] in data.txt and also setting unk_token='[UNK]'?

Any help is very appreciated!

3 Likes

Hi Athena, I’m having the same issue… did you find the root of the problem?

1 Like

I am experiencing this too

2 Likes

Having the same issue…

from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer

tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))

########## Specify [UNK] here ############
trainer = WordLevelTrainer(
    special_tokens=['[UNK]']
)
##########################################

files = ['./datasets/AAABBBCCC.txt']
tokenizer.train(files, trainer) # <--- specify trainer
tokenizer.encode('41').ids
1 Like