Trainer + Datasets + Pytorch Dataloader Workers - how to manage memory usage?

entropy · April 28, 2025, 5:10pm

I’m trying to train a model on a large (600GB on disk) dataset of pre-computed embedding vectors.

Training is i/o bound (loading embeddings from disk). I’m trying to increase the speed of dataloading by using multiple workers, but running into memory issues.

Some benchmarks:

Using 32 workers, I get about 3hr/epoch but OOM about 45 minutes in
Using 0 workers, I get about 8hr/epoch with no memory issues
Using 32>n>0 workers slows down the rate of memory consumption, but doesn’t stop the core issue.

I believe this issue is caused by pytorch’s dataloader workers replicating memory in-process (link), but I don’t know how to stop that. Is there a fix to this, or a way to periodically restart pytorch dataloader workers to clear memory?

Some notes:
Datasets are formatted for pytorch before training
Changing to an iterable dataset slows down the rate of memory accumulation for the same number of workers, but doesn’t stop it

this is using:
torch==2.2.1
transformers==4.51.3
datasets==2.19.1
accelerate==1.6.0

John6666 · April 29, 2025, 10:59am

I found a useful post about PyTorch’s DataLoader.

Topic		Replies	Views
A streaming dataset's memory footprint continually grows 🤗Datasets	8	90	June 19, 2025
Prevent iterable dataset from consuming all the rams Beginners	2	16	June 24, 2025
Does Trainer use multiple workers on datasets? 🤗Transformers	0	529	July 13, 2023
Big text dataset loading for training 🤗Datasets	2	109	May 7, 2025
What are the most effective and reliable ways to load minibatches efficiently from HDD for deep learning training? 🤗Datasets	1	13	May 14, 2025

Trainer + Datasets + Pytorch Dataloader Workers - how to manage memory usage?

Related topics