Hello everyone, I’ve been following HF’s tutorials on “Fine-tuning a masked language model” Fine-tuning a masked language model - Hugging Face Course.
I have got a sample training dataset of 20_000 points, preprocessed according to the need. I’ve used datasets to read the data points & I am also using the data collate function to achieve dynamic masking for every batch.
I am using g4dn.2xlarge (g4dn.2xlarge pricing and specs - Vantage - 32GB RAM, 16GB GPU, 8 vCPUs) instance for fine-tuning Roberta-base
MLM task with a brach size of 8 & each datapoint is of 512 sequence length.
With the above config, I observed GPU memory was very high, 95%+ and system RAM utilization was around 13-15%!
I did set follow the Cache management — datasets 1.12.0 documentation & set IN_MEMORY_MAX_SIZE
to ~25GB, but no luck
import datasets
datasets.config.IN_MEMORY_MAX_SIZE = 24_696_061_952
train_dataset = load_dataset('pandas', data_files={'train': 'path to pickle file'}, keep_in_memory=True)
But RAM usage remained as it is. How can I make full utilization of RAM & GPU memory?
I’ve to take 20_000 as the sample for this experiment but I’ve ~1 Million data points, which I will be using in the future for full-fledged training, post-resolution to this problem
Thanks