I am a student currently working on training the LLAMA-4-Scout-17B-16E-Instruct model using LoRA, running on an H100 GPU with 80GB VRAM (on Lambda Labs). However, I have encountered an out of memory error during the training process. I understand that this might fall slightly outside the scope of the course, but despite extensive research and reviewing various community discussions, I have not been able to resolve the issue.
Here is a brief outline of my setup:
Hardware: H100 (80GB VRAM)
Model: LLAMA-4-Scout-17B-16E-Instruct (download on unsloth hugging face)
Training Method: LoRA
Error: CUDA out of memory
Code snippet:
import torch
from transformers import AutoTokenizer,TrainingArguments,Trainer,DataCollatorForLanguageModeling,AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from accelerate import dispatch_model
from accelerate import Accelerator
from accelerate.utils import get_balanced_memory, infer_auto_device_map
import os
os.environ[“PYTORCH_CUDA_ALLOC_CONF”] = “expandable_segments:True”
model_path = “/home/ubuntu/llama4”
dataset_path = “llama_nc_instruction_train.jsonl”
output_dir = “./merged_llama4_nccode”
print(“ loading tokenizer…”)
tokenizer = AutoTokenizer.from_pretrained(model_path)
print(“ loading model…(使用 safetensors)”)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
)
print(“ applying LoRA setting…”)
lora_config = LoraConfig(
r=8,
lora_alpha=32, #有人用8
target_modules=[“q_proj”, “v_proj”],
lora_dropout=0.05,
bias=“none”,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
print(“ loading data…”)
dataset = load_dataset(“json”, data_files=dataset_path, split=“train”)
def tokenize(example):
tokenized_inputs = tokenizer(
example[“text”],
truncation=True,
padding=“max_length”,
max_length=4196
)
return tokenized_inputs
tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=[“text”])
print(“ establish Trainer…”)
training_args = TrainingArguments(
output_dir=“./lora_tmp”,
num_train_epochs=3,
per_device_train_batch_size=1, #有人用64
gradient_accumulation_steps=512,
learning_rate=2e-4,
logging_steps=10,
save_strategy=“no”,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)
print(“ training…”)
trainer.train()
print(“ merge LoRA weight…”)
model = model.merge_and_unload()
print(“ save model to:”, output_dir)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(“ finish!”)
and this is the error:
載入 tokenizer…
載入模型…(使用 safetensors)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 50/50 [00:00<00:00, 457.56it/s]
套用 LoRA 設定…
載入資料中…
建立 Trainer…
/home/ubuntu/CNC代碼定義訓練黨TEST.py:68: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Trainer.init. Use processing_class instead.
trainer = Trainer(
Traceback (most recent call last):
File “/home/ubuntu/CNC代碼定義訓練黨TEST.py”, line 68, in
trainer = Trainer(
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/utils/deprecation.py”, line 172, in wrapped_func
return func(*args, **kwargs)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/trainer.py”, line 614, in init
self._move_model_to_device(model, args.device)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/trainer.py”, line 901, in _move_model_to_device
model = model.to(device)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1355, in to
return self._apply(convert)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 942, in _apply
param_applied = fn(param)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1341, in convert
return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 79.19 GiB of which 359.06 MiB is free. Including non-PyTorch memory, this process has 78.83 GiB memory in use. Of the allocated memory 78.38 GiB is allocated by PyTorch, and 8.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.7 documentation)
Would anyone kindly offer any suggestions or best practices to address this issue? Are there specific parameters I should consider adjusting (e.g., batch size, gradient checkpointing, LoRA rank, etc.) to make it fit within the memory constraints?
Or is this simply a case of hardware limitation, and even 80GB VRAM is not enough for this model.And i have tried the QLORA method,encountering the same question.