Using large dataset with accelerate

mayurdb · March 6, 2025, 3:18pm

For training, we load a multiple datasets (each with 1000+ 1GB+ arrow files saved using dataset) with load_from_disk()/load_dataset(…streaming=true) and then interleave them together with interleave_datasets.

The training (Seq2SeqTrainer) runs on 8 gpus in 1 node using accelerate + deespeed. Because the training dataset is very large, this is failing with CPU OOM.

Is my understanding correct:

Accelerate would shard the final interleave_datasets() based on the gpu index
There is no communication between the different accelerate processes per gpu for the data loading/distribution part

We were able to alleviate the memory issues by sharding the initial datasets before interleaving.

So is it possible to shard the dataset ourselves and accelerate not do the further sharding? If that is not possible, any recommendation on addressing the memory issue?

Topic		Replies	Views
OOM issue with large dataset streaming 🤗Datasets	6	112	March 15, 2025
Shared Memory in Accelerate 🤗Accelerate	3	2252	January 22, 2023
Working with large datasets 🤗Datasets	5	4141	November 10, 2020
Batching vs. Sharding a Large Dataset 🤗Datasets	4	2208	June 8, 2021
"too many open files" despite streaming with IterableDataset 🤗Datasets	2	27	January 27, 2025

Using large dataset with accelerate

Related topics