Speed up tokenizer training

I am training a tokenizer from scratch on a large dataset. Is there a way to take advantage of parallelism in the tokenizer trainer?

1 Like

I also have this question! I am training several tokenizers from scratch for a research project, and it is quite time-consuming — ~15 hours to count the pairs for 4.5G of text. What resources would be most helpful to request for a training job? Is there a way to take advantage of parallelism, as @astein0 asks?