Low RAM Usage & high GPU usage, Datasets not helping

Hello everyone, I’ve been following HF’s tutorials on “Fine-tuning a masked language model” Fine-tuning a masked language model - Hugging Face Course.

I have got a sample training dataset of 20_000 points, preprocessed according to the need. I’ve used datasets to read the data points & I am also using the data collate function to achieve dynamic masking for every batch.

I am using g4dn.2xlarge (g4dn.2xlarge pricing and specs - Vantage - 32GB RAM, 16GB GPU, 8 vCPUs) instance for fine-tuning Roberta-base MLM task with a brach size of 8 & each datapoint is of 512 sequence length.

With the above config, I observed GPU memory was very high, 95%+ and system RAM utilization was around 13-15%!

I did set follow the Cache management — datasets 1.12.0 documentation & set IN_MEMORY_MAX_SIZE to ~25GB, but no luck

import datasets
datasets.config.IN_MEMORY_MAX_SIZE = 24_696_061_952
train_dataset = load_dataset('pandas', data_files={'train': 'path to pickle file'}, keep_in_memory=True)

But RAM usage remained as it is. How can I make full utilization of RAM & GPU memory?

I’ve to take 20_000 as the sample for this experiment but I’ve ~1 Million data points, which I will be using in the future for full-fledged training, post-resolution to this problem

Thanks

1 Like

How big is your dataset in bytes ? By adding keep_in_memory=True you load it completely in memory.

20,000 is rather small: 20,000 points * 512 tokens * 4 bytes ~= 41MB

1 Like

Thanks for the response.

yes. You are right. Its 41MB. I have a total dataset of ~7GB.

With 16GB GPU, the maximum I can do is 8 batch sizes. If I want to use 16 batch size, then I need 32+GB GPU, but the instance would also have heavy main memory (RAM). i.e., around 64GB or 128GB, etc. But I will be using only a fraction of it (RAM) but full utilization of GPU.

This is the first time I am training MLM, not sure whether this is usually the case.

thanks again

You don’t need to fill up your RAM to get the best data loading throughput while training. If your dataset fits in RAM you’re already all set ^^

1 Like