Inquiry Regarding Out of Memory Issue During LoRA Fine-Tuning

I am a student currently working on training the LLAMA-4-Scout-17B-16E-Instruct model using LoRA, running on an H100 GPU with 80GB VRAM (on Lambda Labs). However, I have encountered an out of memory error during the training process. I understand that this might fall slightly outside the scope of the course, but despite extensive research and reviewing various community discussions, I have not been able to resolve the issue.

Here is a brief outline of my setup:

Hardware: H100 (80GB VRAM)

Model: LLAMA-4-Scout-17B-16E-Instruct (download on unsloth hugging face)

Training Method: LoRA

Error: CUDA out of memory

Code snippet:
import torch
from transformers import AutoTokenizer,TrainingArguments,Trainer,DataCollatorForLanguageModeling,AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from accelerate import dispatch_model
from accelerate import Accelerator
from accelerate.utils import get_balanced_memory, infer_auto_device_map
import os
os.environ[“PYTORCH_CUDA_ALLOC_CONF”] = “expandable_segments:True”

model_path = “/home/ubuntu/llama4”
dataset_path = “llama_nc_instruction_train.jsonl”
output_dir = “./merged_llama4_nccode”

print(“:brain: loading tokenizer…”)
tokenizer = AutoTokenizer.from_pretrained(model_path)

print(“:package: loading model…(使用 safetensors)”)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
)

print(“:wrench: applying LoRA setting…”)
lora_config = LoraConfig(
r=8,
lora_alpha=32, #有人用8
target_modules=[“q_proj”, “v_proj”],
lora_dropout=0.05,
bias=“none”,
task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)

print(“:page_facing_up: loading data…”)
dataset = load_dataset(“json”, data_files=dataset_path, split=“train”)

def tokenize(example):
tokenized_inputs = tokenizer(
example[“text”],
truncation=True,
padding=“max_length”,
max_length=4196
)
return tokenized_inputs

tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=[“text”])

print(“:bullseye: establish Trainer…”)
training_args = TrainingArguments(
output_dir=“./lora_tmp”,
num_train_epochs=3,
per_device_train_batch_size=1, #有人用64
gradient_accumulation_steps=512,
learning_rate=2e-4,
logging_steps=10,
save_strategy=“no”,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

print(“:rocket: training…”)
trainer.train()

print(“:floppy_disk: merge LoRA weight…”)
model = model.merge_and_unload()

print(“:package: save model to:”, output_dir)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(“:white_check_mark: finish!”)

and this is the error:

:brain: 載入 tokenizer…
:package: 載入模型…(使用 safetensors)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 50/50 [00:00<00:00, 457.56it/s]
:wrench: 套用 LoRA 設定…
:page_facing_up: 載入資料中…
:bullseye: 建立 Trainer…
/home/ubuntu/CNC代碼定義訓練黨TEST.py:68: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Trainer.init. Use processing_class instead.
trainer = Trainer(
Traceback (most recent call last):
File “/home/ubuntu/CNC代碼定義訓練黨TEST.py”, line 68, in
trainer = Trainer(
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/utils/deprecation.py”, line 172, in wrapped_func
return func(*args, **kwargs)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/trainer.py”, line 614, in init
self._move_model_to_device(model, args.device)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/trainer.py”, line 901, in _move_model_to_device
model = model.to(device)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1355, in to
return self._apply(convert)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 942, in _apply
param_applied = fn(param)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1341, in convert
return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 79.19 GiB of which 359.06 MiB is free. Including non-PyTorch memory, this process has 78.83 GiB memory in use. Of the allocated memory 78.38 GiB is allocated by PyTorch, and 8.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.7 documentation)

Would anyone kindly offer any suggestions or best practices to address this issue? Are there specific parameters I should consider adjusting (e.g., batch size, gradient checkpointing, LoRA rank, etc.) to make it fit within the memory constraints?
Or is this simply a case of hardware limitation, and even 80GB VRAM is not enough for this model.And i have tried the QLORA method,encountering the same question.

1 Like

It looks like you’re running into a CUDA out of memory issue while fine-tuning LLAMA-4-Scout-17B-16E-Instruct using LoRA on an H100 GPU with 80GB VRAM. Even though 80GB is a lot, large models like this can still exceed memory limits, especially with high batch sizes and gradient accumulation steps.

Possible Causes

  1. Batch Size Too Large – Even though you set per_device_train_batch_size=1, your gradient_accumulation_steps=512 might be causing excessive memory usage.
  2. LoRA Rank & Target Modules – The LoRA rank (r=8) and target modules (q_proj, v_proj) might be consuming more memory than expected.
  3. Token Length Too High – Your max_length=4196 is quite large, leading to high memory consumption per sample.
  4. Memory Fragmentation – Even though you set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, fragmentation might still be an issue.

Potential Fixes

1. Reduce Gradient Accumulation Steps

Try lowering gradient_accumulation_steps to 128 or 64 instead of 512:

training_args = TrainingArguments(
    output_dir="./lora_tmp",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=64,  # Reduce from 512
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="no",
)

This will reduce memory usage significantly.

2. Lower Token Length

Try reducing max_length from 4196 to 2048:

tokenized_inputs = tokenizer(
    example["text"],
    truncation=True,
    padding="max_length",
    max_length=2048  # Reduce from 4196
)

This will cut memory usage per sample in half.

3. Enable Gradient Checkpointing

This helps reduce memory usage by recomputing activations instead of storing them:

model.gradient_checkpointing_enable()

4. Use torch.compile() for Optimization

If you’re using PyTorch 2.0+, try compiling the model for better memory efficiency:

model = torch.compile(model)

5. Offload Model to CPU

If memory is still an issue, offload parts of the model to CPU using accelerate:

from accelerate import infer_auto_device_map, dispatch_model

device_map = infer_auto_device_map(model, max_memory={"cuda": "75GB", "cpu": "20GB"})
model = dispatch_model(model, device_map=device_map)

This ensures that only essential parts stay on the GPU.

Next Steps

Try these adjustments one by one and monitor memory usage. If the issue persists, consider switching to QLoRA with 4-bit quantization, which significantly reduces VRAM usage.

Let me know if you need help implementing these fixes! :rocket:

1 Like