Running low on GPU memory on a cluster with ESM2 lowest config

davguev · November 28, 2023, 10:49pm

I’m trying to train a model using esm2_t6_8M_UR50D, which is the smallest ESM2 model (6 layers, 8M parameters), and TrainingArguments with batch_size=1, gradient_accumulation_steps=1 and fp16=True. I’m training it on a cluster with four 32GB GPUs, and I’m seeing it using all four of them at their fullest. I also set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 beforehand. However, using this configuration, it runs out of memory at iteration 600 (out of 40k).

What else can I do to make it work?

chetan1003 · December 4, 2023, 11:58pm

@davguev what service are you using? are you using a cloud service or do you have your own cluster?

davguev · December 5, 2023, 12:39am

Our lab has its own cluster with 4 GPUs and no one else was using them when I tried to train the model.

Topic		Replies	Views
Why does all my gpu memory get used with a small model? Beginners	5	2135	March 13, 2022
RuntimeError: CUDA out of memory Intermediate	1	1022	April 15, 2021
Can't load huge model onto multiple GPU's Beginners	5	5182	June 15, 2023
Cuda out of memory issue training whisper model on single GPU Intermediate	0	907	December 15, 2023
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch) Beginners	8	27435	December 10, 2023

Running low on GPU memory on a cluster with ESM2 lowest config

Related topics