Low RAM Usage & high GPU usage, Datasets not helping

sachinkalsi · January 10, 2023, 3:28am

Hello everyone, I’ve been following HF’s tutorials on “Fine-tuning a masked language model” Fine-tuning a masked language model - Hugging Face Course.

I have got a sample training dataset of 20_000 points, preprocessed according to the need. I’ve used datasets to read the data points & I am also using the data collate function to achieve dynamic masking for every batch.

I am using g4dn.2xlarge (g4dn.2xlarge pricing and specs - Vantage - 32GB RAM, 16GB GPU, 8 vCPUs) instance for fine-tuning Roberta-base MLM task with a brach size of 8 & each datapoint is of 512 sequence length.

With the above config, I observed GPU memory was very high, 95%+ and system RAM utilization was around 13-15%!

I did set follow the Cache management — datasets 1.12.0 documentation & set IN_MEMORY_MAX_SIZE to ~25GB, but no luck

import datasets
datasets.config.IN_MEMORY_MAX_SIZE = 24_696_061_952
train_dataset = load_dataset('pandas', data_files={'train': 'path to pickle file'}, keep_in_memory=True)

But RAM usage remained as it is. How can I make full utilization of RAM & GPU memory?

I’ve to take 20_000 as the sample for this experiment but I’ve ~1 Million data points, which I will be using in the future for full-fledged training, post-resolution to this problem

Thanks

lhoestq · January 13, 2023, 4:31pm

How big is your dataset in bytes ? By adding keep_in_memory=True you load it completely in memory.

20,000 is rather small: 20,000 points * 512 tokens * 4 bytes ~= 41MB

sachinkalsi · January 13, 2023, 4:46pm

Thanks for the response.

yes. You are right. Its 41MB. I have a total dataset of ~7GB.

With 16GB GPU, the maximum I can do is 8 batch sizes. If I want to use 16 batch size, then I need 32+GB GPU, but the instance would also have heavy main memory (RAM). i.e., around 64GB or 128GB, etc. But I will be using only a fraction of it (RAM) but full utilization of GPU.

This is the first time I am training MLM, not sure whether this is usually the case.

thanks again

lhoestq · January 13, 2023, 4:52pm

You don’t need to fill up your RAM to get the best data loading throughput while training. If your dataset fits in RAM you’re already all set ^^

Topic		Replies	Views
Best practices for a large dataset 🤗Datasets	7	1606	May 6, 2025
Loading a large dataset occupies ~2GB on each GPU 🤗Datasets	0	102	April 24, 2024
How to train a language model from scratch when my dataset is bigger than RAM? Beginners	19	9742	September 18, 2020
GPU Memory usages varies with the size of the dataset 🤗Datasets	2	644	January 14, 2022
Expected memory usage of Dataset Beginners	1	2807	July 4, 2023

Low RAM Usage & high GPU usage, Datasets not helping

Related topics