I understand training sentence piece model in monolingual case. But in multilingual case, its not clear enough. It’s because dataset sizes across languages varies greatly. I think this leads to biased shared vocabulary.
- Is it using sampling technique while training sentence piece as well?
- If yes, how many times is sampling performed?
- Isn’t it better to go through all the text in dataset to create sub-words vocab instead of just the samples?