Why is Trainer single-threaded during "Generating split..."?

I am training a 3B LLM from scratch with 1 million text samples, and it takes 1/2 hour to get through the “Generating split train…”. I get ~500 samples per second maximum in bursts. I’m wanting to scale to 200 Million samples, which says it will take 7 days. If I tokenize and pack the data myself using map(), with num_proc of 64, and large batches, this can be done in less than 8 hours instead of days.

I see that you can pass in a sharded file if you have a streaming dataset, but using streaming gives other errors. How can I trick it to use multiple workers when Trainer is “Generating split” inside of Trainer or SFTTrainer?