Hi All,
I am trying to do LoRA fine-tuning on Gemma using the SFTTrainer using the Kaggle Notebooks environment with a P100 Accelerator. The model trains fine for approximately 2.5/3 epochs. Then, however, out of nowhere, it crashes with the following error:
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
I have no idea why this might occur, and I have no idea why it occurs only after 2.54 epochs, and not any time earlier. I would appreciate if someone could help me find and fix the cause of this problem.
For reference, here is my training set-up:
peft_config = LoraConfig(
lora_alpha=256,
lora_dropout=0.10,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
)
training_arguments = TrainingArguments(
output_dir="logs",
save_steps=15,
save_total_limit=2,
logging_steps=25,
max_steps=-1,
load_best_model_at_end=True,
evaluation_strategy='steps',
eval_steps = 15,
eval_accumulation_steps=1,
num_train_epochs=3,
gradient_checkpointing=True,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=8,
optim="paged_adamw_32bit",
learning_rate=3e-5,
weight_decay=0.01,
fp16=True,
bf16=False,
max_grad_norm=0.3,
warmup_ratio=0.03,
group_by_length=False,
lr_scheduler_type="cosine",
report_to="none"
)
trainer = SFTTrainer(
model=model,
train_dataset=train_data,
eval_dataset=eval_data,
peft_config=peft_config,
dataset_text_field="text",
tokenizer=tokenizer,
max_seq_length=MAX_SEQ_LENGTH,
args=training_arguments,
packing=False,
)
# Train model
trainer.train()
Edit: Here are what the notebook output looks like