WordLevel error: Missing [UNK] token from the vocabulary

Athena · March 26, 2021, 12:24am

Hi, I am trying to train a basic Word Level tokenizer based on a file data.txt containing

5174 5155 4749 4814 4832 4761 4523 4999 4860 4699 5024 4788 [UNK]

When I run my code

from tokenizers import Tokenizer
from tokenizers.models import WordLevel

tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))
tokenizer.train(files=['data.txt'])
tokenizer.encode('5155')

I get the error

Exception: WordLevel error: Missing [UNK] token from the vocabulary

Why is it still missing despite having [UNK] in data.txt and also setting unk_token='[UNK]'?

Any help is very appreciated!

lesscomfortable · June 17, 2021, 8:48pm

Hi Athena, I’m having the same issue… did you find the root of the problem?

echizen · August 26, 2021, 8:02pm

I am experiencing this too

antoine2323231 · September 21, 2022, 10:50am

Having the same issue…

lianghsun · October 27, 2022, 9:09am

from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer

tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))

########## Specify [UNK] here ############
trainer = WordLevelTrainer(
    special_tokens=['[UNK]']
)
##########################################

files = ['./datasets/AAABBBCCC.txt']
tokenizer.train(files, trainer) # <--- specify trainer
tokenizer.encode('41').ids

Topic		Replies	Views
Transformers: WordLevel tokenizer produces strange vocabulary Beginners	1	283	August 30, 2021
Unk_token not set after training a BPETokenizer tokenizer 🤗Tokenizers	1	604	November 1, 2023
Reused tokenizer returns unk 🤗Tokenizers	1	519	March 14, 2024
How to use unk_token (unknown token) during wav2vec model finetuning Models	2	3779	May 19, 2022
Word level tokenizer pulls special tokens out of pretokenized strings 🤗Tokenizers	3	20	July 4, 2025

WordLevel error: Missing [UNK] token from the vocabulary

Related topics