I am using Seq2SeqTrainer to lora finetune madlad 3b model using my own dataset.
When I run trainer.evaluate()
, the evaluation loss is not printing.
Code:
output_dir = "madlad_run_5"
training_args = Seq2SeqTrainingArguments(
output_dir,
per_device_train_batch_size=32,
# per_device_eval_batch_size=2,
num_train_epochs=1,
# max_steps=1000,
# gradient_accumulation_steps=8,
bf16=True,
torch_compile=True,
# gradient_checkpointing=True,
# torch_empty_cache_steps=10,
learning_rate=3e-4,
lr_scheduler_type="cosine",
weight_decay=0.00001,
warmup_ratio=0.1,
report_to="tensorboard",
logging_dir=f"{output_dir}/tensorboard_logs",
logging_first_step=True,
logging_steps=1,
save_strategy="epoch",
# save_steps=1000,
save_total_limit=1,
eval_strategy="steps",
eval_steps=5,
include_for_metrics=["loss"],
# batch_eval_metrics=True,
optim="paged_adamw_32bit",
dataloader_pin_memory=True, # optimization
dataloader_num_workers=4,
# eval_on_start=True,
)
trainer = Seq2SeqTrainer(
peft_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model),
# compute_metrics=compute_metrics,
# callbacks=[eval_loss_callback] # Add custom callback
)
eval_results = trainer.evaluate()
print(eval_results)
# trainer.train()
# peft_model.save_pretrained("madlad_run_2_lora_ckpt")
Output:
{'eval_runtime': 10.9764, 'eval_samples_per_second': 9.11, 'eval_steps_per_second': 0.456}
I want to get the evaluation loss also.
Some extra context:
I am using tensorboard for logging and in the initial training runs, I was not getting the evaluation loss also in the logs. So I tried some solutions like adding callbacks, compute_metrics
function, etc but the evaluation loss was still not coming.
Then I tried to directly run trainer.evaluate()
but here also evaluation loss was missing.
How to get the evaluation loss also when the model is evaluated during training i.e. in the evaluation logs and in trainer.evaluate()
?
System configuration:
- `transformers` version: 4.48.3
- Platform: Linux-5.15.0-127-generic-x86_64-with-glibc2.35
- Python version: 3.10.16
- Huggingface_hub version: 0.28.1
- Safetensors version: 0.5.2
- Accelerate version: 1.3.0
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 3
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA A40
Let me know if any other information is required. Please resolve this.