Hi there, thanks for your amazing work !
I encounter an issue when I’m using Unigram and UnigramTrainer.
In fact I want to build a vocabulary of size 10 (based on vocabulary built with BPE)
But I don’t understand why, using both train_from_iterator (with list and with iterator) and with train (from file name containing same information as in list) I can’t retrieve a vocab of wished size.
Here is some code to work with:
corpus = ["table", "bleu", "cable"]
# tokenizers.__version__ == 0.13.2
from tokenizers import Tokenizer
from tokenizers.models import Unigram
vocab = ["a", "b", "c", "e", "l", "t", "u", "bl", "ble", "able", "cable"]
tokenizer = Tokenizer(Unigram())
tokenizer.add_tokens(vocab)
print("With added tokens")
print(tokenizer.get_vocab(with_added_tokens=True))
print(tokenizer.get_vocab_size(with_added_tokens=True))
print("Without added tokens")
print(tokenizer.get_vocab(with_added_tokens=False))
print(tokenizer.get_vocab_size(with_added_tokens=False))
from tokenizers.trainers import UnigramTrainer
trainer = UnigramTrainer(vocab_size=10)#, initial_alphabet=["a", "b", "c", "e", "l", "t", "u"])
tokenizer.train_from_iterator(corpus, trainer=trainer)
#tokenizer.train(["corpus.txt"], trainer=trainer)
#tokenizer.train_from_iterator(iterator(), trainer=trainer)
print("With added tokens")
print(dict(sorted(tokenizer.get_vocab(with_added_tokens=True).items(), key=lambda item: item[1])))
print(tokenizer.get_vocab_size(with_added_tokens=True) )
print("Without added tokens")
print(dict(sorted(tokenizer.get_vocab(with_added_tokens=False).items(), key=lambda item: item[1])))
print(tokenizer.get_vocab_size(with_added_tokens=False) )
And results associated :