I am trying to train my own Sentencepiece unigram tokenizer, but the EM training step is taking nearly ~2x predicted iterations to prune the vocabulary. I was wondering if this is normal and whether I should retry with larger shrink factor/vocabulary. Any hints? Thanks!
[02:22:46] Pre-processing sequences βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0 / 0
[00:11:52] Suffix array seeds βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 43102066 / 43102066
[1d 00:50:16] EM training ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 72 / 34