Why is Trainer single-threaded during "Generating split..."?

jlmckins · April 17, 2024, 9:55pm

I am training a 3B LLM from scratch with 1 million text samples, and it takes 1/2 hour to get through the “Generating split train…”. I get ~500 samples per second maximum in bursts. I’m wanting to scale to 200 Million samples, which says it will take 7 days. If I tokenize and pack the data myself using map(), with num_proc of 64, and large batches, this can be done in less than 8 hours instead of days.

I see that you can pass in a sharded file if you have a streaming dataset, but using streaming gives other errors. How can I trick it to use multiple workers when Trainer is “Generating split” inside of Trainer or SFTTrainer?

Topic		Replies	Views
Trainer being very slow to init training setting group_by_length to True 🤗Transformers	1	300	February 1, 2025
Training with Trainer really slow 🤗Transformers	0	1620	June 12, 2023
Multilingual batches 🤗Datasets	3	48	December 12, 2024
Run_mlm.py: generate train split is very slow Beginners	1	853	September 27, 2022
Does Trainer use multiple workers on datasets? 🤗Transformers	0	527	July 13, 2023

Why is Trainer single-threaded during "Generating split..."?

Related topics