What exacly is changed in the tokenizer after training it?

Only the new words it learns are added? Or anything else? (other words removed, or more things?)
I ask since after I trained the tokenizer, new words of the dataset indeed were added, but the model (which I used with this pre-trained tokenizer) performance was reduced. How can I debug what made the performance to reduce (no change in the model bedore and after the tokenizer training)?

Thanks,
Michal

1 Like