CUDA Runtime Error in the Middle of Training

Hi All,

I am trying to do LoRA fine-tuning on Gemma using the SFTTrainer using the Kaggle Notebooks environment with a P100 Accelerator. The model trains fine for approximately 2.5/3 epochs. Then, however, out of nowhere, it crashes with the following error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I have no idea why this might occur, and I have no idea why it occurs only after 2.54 epochs, and not any time earlier. I would appreciate if someone could help me find and fix the cause of this problem.

For reference, here is my training set-up:

peft_config = LoraConfig(
    lora_alpha=256,
    lora_dropout=0.10,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
)

training_arguments = TrainingArguments(
    output_dir="logs",
    save_steps=15,
    save_total_limit=2,
    logging_steps=25,
    max_steps=-1,
    load_best_model_at_end=True,
    evaluation_strategy='steps',
    eval_steps = 15,
    eval_accumulation_steps=1,
    num_train_epochs=3,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    optim="paged_adamw_32bit",
    learning_rate=3e-5,
    weight_decay=0.01,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    group_by_length=False,
    lr_scheduler_type="cosine",
    report_to="none"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=MAX_SEQ_LENGTH,
    args=training_arguments,
    packing=False,
)

# Train model
trainer.train()

Edit: Here are what the notebook output looks like

You should check the code for errors, make sure that the GPU resources are sufficient to perform the operations, and also try changing training parameters such as batch size or training step. Additionally, you can try setting the CUDA_LAUNCH_BLOCKING=1 environment variable for more detailed CUDA error output, which may help in identifying the cause of the problem.