Hi, I am trying to train a basic Word Level tokenizer based on a file data.txt
containing
5174 5155 4749 4814 4832 4761 4523 4999 4860 4699 5024 4788 [UNK]
When I run my code
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))
tokenizer.train(files=['data.txt'])
tokenizer.encode('5155')
I get the error
Exception: WordLevel error: Missing [UNK] token from the vocabulary
Why is it still missing despite having [UNK]
in data.txt
and also setting unk_token='[UNK]'
?
Any help is very appreciated!