CUDA OOM with deepspeed - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 209.12 MiB is free

Hello,

i have used google before but nothing helped me.

When i run my script with “deepspeed test_deepspeed.py” i get the following errors.

When i wath with nvidia-smi i get only my two processes each per GPU.

How can i handle this OOM Cuda error? Whan i run with llama-3.1-8b everything works fine.

My server has two RTX 6000 Ada. 512Gb Ram, 2x AMD Epyc 9124 (16C/32T). I think this is quiete enough for ofloading while fintuning.

Can you give me a hint where my error is? My idea is to quantizie it to 8bit and the run the fine-tune. It should work as translator at the end.

Thanks a lot guys! Appreciate your help.


[2024-12-14 09:35:04,046] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Belegter GPU-Speicher: 0.00 GiB
Reservierter GPU-Speicher: 0.00 GiB
[2024-12-14 09:35:04,672] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
low_cpu_mem_usage was None, now default to True since model is quantized.
Loading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]Belegter GPU-Speicher: 0.00 GiB
Reservierter GPU-Speicher: 0.00 GiB
low_cpu_mem_usage was None, now default to True since model is quantized.
Loading checkpoint shards: 33%|█████████████████████████████████████████████████████████▋ | 10/30 [00:19<00:38, 1.93s/it]
Traceback (most recent call last):
File “/root/scripts/test_deepspeed.py”, line 51, in
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py”, line 564, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/transformers/modeling_utils.py”, line 4264, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/transformers/modeling_utils.py”, line 4777, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/transformers/modeling_utils.py”, line 944, in _load_state_dict_into_meta_model
hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
File “/root/venv/lib/python3.11/site-packages/transformers/quantizers/quantizer_bnb_8bit.py”, line 226, in create_quantized_param
new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs).to(target_device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/bitsandbytes/nn/modules.py”, line 626, in to
return self.cuda(device)
^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/bitsandbytes/nn/modules.py”, line 589, in cuda
CB, SCB, _ = bnb.functional.int8_vectorwise_quant(B)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/bitsandbytes/functional.py”, line 2777, in int8_vectorwise_quant
out_row = torch.empty(A.shape, device=A.device, dtype=torch.int8)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 209.12 MiB is free. Including non-PyTorch memory, this process has 24.66 GiB memory in use. Process 36676 has 22.52 GiB memory in use. Of the allocated memory 24.14 GiB is allocated by PyTorch, and 112.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.5 documentation)
Loading checkpoint shards: 30%|████████████████████████████████████████████████████▏ | 9/30 [00:18<00:44, 2.10s/it]
Traceback (most recent call last):
File “/root/scripts/test_deepspeed.py”, line 51, in
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py”, line 564, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/transformers/modeling_utils.py”, line 4264, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/transformers/modeling_utils.py”, line 4777, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/transformers/modeling_utils.py”, line 944, in _load_state_dict_into_meta_model
hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
File “/root/venv/lib/python3.11/site-packages/transformers/quantizers/quantizer_bnb_8bit.py”, line 226, in create_quantized_param
new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs).to(target_device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/bitsandbytes/nn/modules.py”, line 626, in to
return self.cuda(device)
^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/bitsandbytes/nn/modules.py”, line 589, in cuda
CB, SCB, _ = bnb.functional.int8_vectorwise_quant(B)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/venv/lib/python3.11/site-packages/bitsandbytes/functional.py”, line 2777, in int8_vectorwise_quant
out_row = torch.empty(A.shape, device=A.device, dtype=torch.int8)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 209.12 MiB is free. Process 36675 has 24.66 GiB memory in use. Including non-PyTorch memory, this process has 22.52 GiB memory in use. Of the allocated memory 21.97 GiB is allocated by PyTorch, and 137.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.5 documentation)


Deepspeed config:
{
“zero_optimization”: {
“stage”: 3,
“offload_optimizer”: {
“device”: “cpu”,
“pin_memory”: true
},
“offload_param”: {
“device”: “cpu”,
“pin_memory”: true
},
“overlap_comm”: true,
“contiguous_gradients”: true
},
“bf16”: {
“enabled”: true
},
“train_micro_batch_size_per_gpu”: “auto”,
“gradient_accumulation_steps”: “auto”,
“gradient_clipping”: 1.0
}


My Python script:
import os
os.environ[“PYTORCH_CUDA_ALLOC_CONF”] = “expandable_segments:True”

import logging

logging.basicConfig(level=logging.INFO)

from datasets import load_dataset

Lade die JSONL-Datei als Dataset

dataset = load_dataset(“json”, data_files=“reitsport_lexikon.jsonl”)

Zeige ein Beispiel an

print(dataset[‘train’][0])

from transformers import AutoTokenizer

Tokenizer für das gewünschte Modell laden

tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Llama-3.3-70B-Instruct”)

tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
tokens = tokenizer(examples[“text”], padding=“max_length”, truncation=True, max_length=512)
tokens[“labels”] = tokens[“input_ids”].copy()
return tokens

tokenized_dataset = dataset.map(tokenize_function, batched=True)

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import deepspeed

import torch

torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

print(f"Belegter GPU-Speicher: {torch.cuda.memory_allocated() / 10243:.2f} GiB")
print(f"Reservierter GPU-Speicher: {torch.cuda.memory_reserved() / 1024
3:.2f} GiB")

Konfiguration für 8-Bit-Quantisierung

bnb_config = BitsAndBytesConfig(
load_in_8bit=True # 8-Bit-Quantisierung aktivieren
)

Modell laden

#model = AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-3.3-70B-Instruct”, torch_dtype=“auto”)

model = AutoModelForCausalLM.from_pretrained(
“meta-llama/Llama-3.3-70B-Instruct”,
quantization_config=bnb_config
)

LoRA-Konfiguration

lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=32,
lora_dropout=0.1
)

model = get_peft_model(model, lora_config)

Trainingsargumente mit DeepSpeed-Konfiguration

training_args = TrainingArguments(
output_dir=“llama_reitsport_finetuned”,
gradient_accumulation_steps=8, # Gradientenakkumulation erhöhen
per_device_train_batch_size=1,
num_train_epochs=20,
eval_strategy=“epoch”,
save_strategy=“epoch”,
learning_rate=2e-4,
bf16=True,
deepspeed=“test_ds_config.json”, # DeepSpeed-Konfigurationsdatei einbinden
)

Eval-Dataset erstellen (analog zum Train-Dataset)

small_train_dataset = tokenized_dataset[“train”].shuffle(seed=42).select(range(len(tokenized_dataset[“train”])))

small_eval_dataset = tokenized_dataset[“train”].shuffle(seed=42).select(range(len(tokenized_dataset[“train”])))

Trainer erstellen mit eval_dataset

trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
)

print(f"Belegter GPU-Speicher: {torch.cuda.memory_allocated() / 10243:.2f} GiB")
print(f"Reservierter GPU-Speicher: {torch.cuda.memory_reserved() / 1024
3:.2f} GiB")

import gc
gc.collect()

Training starten

trainer.train()

print(f"Belegter GPU-Speicher: {torch.cuda.memory_allocated() / 10243:.2f} GiB")
print(f"Reservierter GPU-Speicher: {torch.cuda.memory_reserved() / 1024
3:.2f} GiB")

1 Like