EM training on unigram tokenizer taking way longer than predicted

ddeerreekk · June 23, 2022, 11:39am

I am trying to train my own Sentencepiece unigram tokenizer, but the EM training step is taking nearly ~2x predicted iterations to prune the vocabulary. I was wondering if this is normal and whether I should retry with larger shrink factor/vocabulary. Any hints? Thanks!

[02:22:46] Pre-processing sequences                 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 0        /        0
[00:11:52] Suffix array seeds                       █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 43102066 / 43102066
[1d 00:50:16] EM training                              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 72       /       34

Topic		Replies	Views
Training unigram on long sequences 🤗Tokenizers	4	1275	June 23, 2022
Training a tokenizer Beginners	1	444	August 3, 2022
How long to expect training to take, and guidance on subset size? 🤗Tokenizers	1	2029	May 23, 2024
Running train_new_from_iterator to train a tokenizer is very slow 🤗Tokenizers	1	416	April 13, 2024
Unigram vocab_size doesn't fit 🤗Tokenizers	0	422	November 28, 2022

EM training on unigram tokenizer taking way longer than predicted

Related topics