For training, we load a multiple datasets (each with 1000+ 1GB+ arrow files saved using dataset) with load_from_disk()/load_dataset(…streaming=true) and then interleave them together with interleave_datasets.
The training (Seq2SeqTrainer) runs on 8 gpus in 1 node using accelerate + deespeed. Because the training dataset is very large, this is failing with CPU OOM.
Is my understanding correct:
- Accelerate would shard the final interleave_datasets() based on the gpu index
- There is no communication between the different accelerate processes per gpu for the data loading/distribution part
We were able to alleviate the memory issues by sharding the initial datasets before interleaving.
So is it possible to shard the dataset ourselves and accelerate not do the further sharding? If that is not possible, any recommendation on addressing the memory issue?