CUDA Runtime Error in the Middle of Training

alex-yang-upenn · March 29, 2024, 5:53am

Hi All,

I am trying to do LoRA fine-tuning on Gemma using the SFTTrainer using the Kaggle Notebooks environment with a P100 Accelerator. The model trains fine for approximately 2.5/3 epochs. Then, however, out of nowhere, it crashes with the following error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I have no idea why this might occur, and I have no idea why it occurs only after 2.54 epochs, and not any time earlier. I would appreciate if someone could help me find and fix the cause of this problem.

For reference, here is my training set-up:

peft_config = LoraConfig(
    lora_alpha=256,
    lora_dropout=0.10,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
)

training_arguments = TrainingArguments(
    output_dir="logs",
    save_steps=15,
    save_total_limit=2,
    logging_steps=25,
    max_steps=-1,
    load_best_model_at_end=True,
    evaluation_strategy='steps',
    eval_steps = 15,
    eval_accumulation_steps=1,
    num_train_epochs=3,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    optim="paged_adamw_32bit",
    learning_rate=3e-5,
    weight_decay=0.01,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    group_by_length=False,
    lr_scheduler_type="cosine",
    report_to="none"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=MAX_SEQ_LENGTH,
    args=training_arguments,
    packing=False,
)

# Train model
trainer.train()

Edit: Here are what the notebook output looks like

KeelyPowers · March 30, 2024, 12:06am

You should check the code for errors, make sure that the GPU resources are sufficient to perform the operations, and also try changing training parameters such as batch size or training step. Additionally, you can try setting the CUDA_LAUNCH_BLOCKING=1 environment variable for more detailed CUDA error output, which may help in identifying the cause of the problem.

Topic		Replies	Views
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP) 🤗Transformers	11	3517	October 1, 2024
LoRA Finetuning RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	4	41	June 16, 2025
I am getting Runtime error when i am trying to fine tune the Code LLama on custom dataset Intermediate	0	16	July 26, 2024
RuntimeError with Mixed Precision during LoRA Fine-Tuning in LLAVA on Small GPU Machine 🤗Transformers	1	218	September 23, 2024
Inquiry Regarding Out of Memory Issue During LoRA Fine-Tuning Models	2	116	May 5, 2025

CUDA Runtime Error in the Middle of Training

Related topics