Unigram vocab_size doesn't fit

Hi there, thanks for your amazing work !

I encounter an issue when I’m using Unigram and UnigramTrainer.
In fact I want to build a vocabulary of size 10 (based on vocabulary built with BPE)

But I don’t understand why, using both train_from_iterator (with list and with iterator) and with train (from file name containing same information as in list) I can’t retrieve a vocab of wished size.

Here is some code to work with:

corpus = ["table", "bleu", "cable"]
# tokenizers.__version__ == 0.13.2
from tokenizers import Tokenizer
from tokenizers.models import Unigram

vocab = ["a", "b", "c", "e", "l", "t", "u", "bl", "ble", "able", "cable"]
tokenizer = Tokenizer(Unigram())
tokenizer.add_tokens(vocab)
print("With added tokens")
print(tokenizer.get_vocab(with_added_tokens=True))
print(tokenizer.get_vocab_size(with_added_tokens=True))
print("Without added tokens")
print(tokenizer.get_vocab(with_added_tokens=False))
print(tokenizer.get_vocab_size(with_added_tokens=False))

from tokenizers.trainers import UnigramTrainer
trainer = UnigramTrainer(vocab_size=10)#, initial_alphabet=["a", "b", "c", "e", "l", "t", "u"])
tokenizer.train_from_iterator(corpus, trainer=trainer)
#tokenizer.train(["corpus.txt"], trainer=trainer)
#tokenizer.train_from_iterator(iterator(), trainer=trainer)

print("With added tokens")
print(dict(sorted(tokenizer.get_vocab(with_added_tokens=True).items(), key=lambda item: item[1])))
print(tokenizer.get_vocab_size(with_added_tokens=True) )

print("Without added tokens")
print(dict(sorted(tokenizer.get_vocab(with_added_tokens=False).items(), key=lambda item: item[1])))
print(tokenizer.get_vocab_size(with_added_tokens=False) )

And results associated :