Shared Memory in Accelerate

Hameliton · December 30, 2022, 12:01am

Hey, I have a question about how to interact with shared memory and accelerate.

I have been using accelerate to streamline my multi-GPU training, specifically performing distributed training across 4 GPU’s. However, my dataset is very large (40GB) and when it is copied to 4 GPU’s it takes up over 160GB of RAM.

The dataset itself is just a single tensor object that contains the same data across each device. Is there a way to force accelerate to use a single shared memory location for the dataset so that it only takes 40GB of RAM instead of 160GB?

sgugger · December 30, 2022, 7:44am

Not really. That’s why we use Datasets in all our examples, which caches everything on disk, so nothing takes space in RAM in those kinds of distributed training.

One workaround would be to define your datasets as None/empty on all processes except process 0 and use dispatch_batches=True.

rajkumarrrk · January 2, 2023, 9:14am

@sgugger, I have a related question. I am trying to understand how distributed code with accelerate works. How do I synchronize certain variables across different processes? Is FileStorage is the only possible shared memory?

Hameliton · January 22, 2023, 5:15am

@sgugger Thanks for the response! It has been a while but I have a follow up question to this. I did what you suggested and set all of my datasets except for the main process one to be empty.

However, I guess I am a bit unsure on the internals of how dispatch_batches actually operates. Since all the other processes have empty datasets, they fly through my training loops and everything becomes out-of-sync pretty fast.

Is there any way for me to keep the processes synced up with accelerate even when only one process actually has data using dispatch_batches?

It’s kind of a weird use case for the API but I appreciate any advice on this!

Topic		Replies	Views
Using large dataset with accelerate 🤗Accelerate	0	43	March 6, 2025
Dataloader fetches slowly using accelerator for distributed training 🤗Accelerate	0	1203	October 29, 2021
Clear Cache with Accelerate 🤗Accelerate	3	6864	May 5, 2023
Accelerate stalls when using Tensor Dataset 🤗Accelerate	0	312	December 31, 2023
Multi-GPU Training using Accelerate: RAM Issue Leading to Failure 🤗Accelerate	0	91	July 16, 2024

Shared Memory in Accelerate

Related topics