Cuda out of memory - knowledge distillation

My code is available in the google folder below:
https://drive.google.com/drive/folders/18zpr_RuDY59Bu94M31z4492GYUlwoS48?usp=share_link

I run the new_ddp.py file using the jobscript_new_ddp file. With the command: sbatch jobscript_new_ddp

Here is my system info:

  • Accelerate version: 0.26.1
  • Platform: Linux-4.18.0-372.57.1.el8_6.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.18
  • Numpy version: 1.26.1
  • PyTorch version (GPU?): 2.1.2+cu121 (False)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 251.38 GB
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: MULTI_GPU
    • mixed_precision: fp16
    • use_cpu: False
    • debug: False
    • num_processes: 2
    • machine_rank: 0
    • num_machines: 1
    • gpu_ids: all
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env:

I keep getting the following CUDA error, even though I run the code using a very small dataset and two GPUs. I should have more than enough memory.
File “/gpfs/home2/ngroot/new_ddp.py”, line 174, in
trainer.train()
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 1539, in train
return inner_training_loop(
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 1944, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 2291, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 3095, in evaluate
output = eval_loop(
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer.py”, line 3310, in evaluation_loop
preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer_pt_utils.py”, line 123, in nested_concat
return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
File “/home/ngroot/anaconda3/envs/llmke/lib/python3.9/site-packages/transformers/trainer_pt_utils.py”, line 82, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.57 GiB. GPU 0 has a total capacty of 39.39 GiB of which 5.26 GiB is free. Including non-PyTorch memory, this process has 34.12 GiB memory in use. Of the allocated memory 31.64 GiB is allocated by PyTorch, and 1.73 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Does anyone have suggestions on what I am doing wrong and how to fix it?
Thanks.

1 Like

First thing I can think of, let’s clean up this code.

Don’t call accelerator.prepare(). The Trainer uses accelerate under the hood. Does this still occur if you skip this step?

(You also don’t need to move any models to devices)