Speed up tokenizer training

astein0 · March 7, 2024, 4:28pm

I am training a tokenizer from scratch on a large dataset. Is there a way to take advantage of parallelism in the tokenizer trainer?

alisawuffles · April 22, 2024, 8:07pm

I also have this question! I am training several tokenizers from scratch for a research project, and it is quite time-consuming — ~15 hours to count the pairs for 4.5G of text. What resources would be most helpful to request for a training job? Is there a way to take advantage of parallelism, as @astein0 asks?

maveriq · May 17, 2024, 11:47pm

same question. I have come up with a pyspark implementation of minbpe which will give you merges and vocabulary for a BPE tokenizer. It still needs some more work to optimize the speed but it works decent. Ping me if anyone wants to collaborate on that.

Meanwhile it would be nice to have an official implementation, either using tokenizer or the datatrove library. @lhoestq

alisawuffles · May 28, 2024, 8:01pm

It turns out that setting TOKENIZERS_PARALLELISM=true solved my problem.

subatomicseer · September 17, 2024, 1:28pm

If you are using any form of multiprocessing after importing the tokenisers packages, then it will set the parallelism off.

So doing something like this will ensure that the tokenizers will use parallelism:

dataset = dataset.map(map_fn, batched=True, batch_size=64, num_proc=128)

# set the following environment variable to true and then import tokenizer modules
os.environ["TOKENIZERS_PARALLELISM"] = "true"
from tokenizers import (
  decoders,
  models,
  pre_tokenizers,
  normalizers,
  trainers,
  Tokenizer,
  Regex
)
tokenizer = Tokenizer(models.Unigram())

It worked for me.

system · December 3, 2024, 11:00pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running train_new_from_iterator to train a tokenizer is very slow 🤗Tokenizers	1	416	April 13, 2024
Tokenizer taking extremely long time to train 🤗Tokenizers	1	970	March 19, 2025
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2237	November 11, 2024
Tokenizer dataset is very slow 🤗Tokenizers	3	4316	March 2, 2024
How long to expect training to take, and guidance on subset size? 🤗Tokenizers	1	2038	May 23, 2024

Speed up tokenizer training

Related topics