I am training a tokenizer from scratch on a large dataset. Is there a way to take advantage of parallelism in the tokenizer trainer?
I also have this question! I am training several tokenizers from scratch for a research project, and it is quite time-consuming — ~15 hours to count the pairs for 4.5G of text. What resources would be most helpful to request for a training job? Is there a way to take advantage of parallelism, as @astein0 asks?
same question. I have come up with a pyspark implementation of minbpe which will give you merges and vocabulary for a BPE tokenizer. It still needs some more work to optimize the speed but it works decent. Ping me if anyone wants to collaborate on that.
Meanwhile it would be nice to have an official implementation, either using tokenizer or the datatrove library. @lhoestq
It turns out that setting TOKENIZERS_PARALLELISM=true
solved my problem.
If you are using any form of multiprocessing after importing the tokenisers packages, then it will set the parallelism off.
So doing something like this will ensure that the tokenizers will use parallelism:
dataset = dataset.map(map_fn, batched=True, batch_size=64, num_proc=128)
# set the following environment variable to true and then import tokenizer modules
os.environ["TOKENIZERS_PARALLELISM"] = "true"
from tokenizers import (
decoders,
models,
pre_tokenizers,
normalizers,
trainers,
Tokenizer,
Regex
)
tokenizer = Tokenizer(models.Unigram())
It worked for me.