Cuda out of memory - knowledge distillation

ninagroot · February 20, 2024, 9:59am

My code is available in the google folder below:
https://drive.google.com/drive/folders/18zpr_RuDY59Bu94M31z4492GYUlwoS48?usp=share_link

I run the new_ddp.py file using the jobscript_new_ddp file. With the command: sbatch jobscript_new_ddp

Here is my system info:

Accelerate version: 0.26.1
Platform: Linux-4.18.0-372.57.1.el8_6.x86_64-x86_64-with-glibc2.28
Python version: 3.9.18
Numpy version: 1.26.1
PyTorch version (GPU?): 2.1.2+cu121 (False)
PyTorch XPU available: False
PyTorch NPU available: False
System RAM: 251.38 GB
Accelerate default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env:

I keep getting the following CUDA error, even though I run the code using a very small dataset and two GPUs. I should have more than enough memory.
File “/gpfs/home2/ngroot/new_ddp.py”, line 174, in
trainer.train()
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 1539, in train
return inner_training_loop(
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 1944, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 2291, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 3095, in evaluate
output = eval_loop(
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 3310, in evaluation_loop
preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer_pt_utils.py”, line 123, in nested_concat
return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer_pt_utils.py”, line 82, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.57 GiB. GPU 0 has a total capacty of 39.39 GiB of which 5.26 GiB is free. Including non-PyTorch memory, this process has 34.12 GiB memory in use. Of the allocated memory 31.64 GiB is allocated by PyTorch, and 1.73 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Does anyone have suggestions on what I am doing wrong and how to fix it?
Thanks.

muellerzr · February 29, 2024, 5:07pm

First thing I can think of, let’s clean up this code.

Don’t call accelerator.prepare(). The Trainer uses accelerate under the hood. Does this still occur if you skip this step?

(You also don’t need to move any models to devices)

Topic		Replies	Views
DDP running out of memory but DP is successful for the same per_device_train_batch_size 🤗Accelerate	0	387	February 5, 2024
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5338	June 7, 2023
CUDA out of Memory even on a RTX 4070 Super Models	4	117	December 31, 2024
Fine-Tuning GPT-J CUDA Memory Error Amazon SageMaker	1	805	February 13, 2023
CUDA out of memory on multi-GPU 🤗Transformers	1	2643	March 6, 2024

Cuda out of memory - knowledge distillation

Related topics