Hi, I built from source yesterday but I still don’t think I’m seeing the expected behavior when it comes to logging.
With gradient_accumulation_steps=16, logging_steps=100 and eval_steps=100, I expect to see both the loss and validation metrics printed at iteration 100 but nothing is printed at step 100.
With gradient_accumulation_steps=1, logging_steps=100 and eval_steps=100, only the loss and learning rate (no eval metrics) are printed once at step 100 and then at step 200 cuda runs out of memory. (With the prev config gradient_accumulation_steps=16, logging_steps=100 and eval_steps=100, the memory crash doesn’t happen).
In addition I can’t seem to find any of the train or eval metrics in the tf event files. Is there anything that needs to be done explicitly to log info to tensorboard?
I added my compute metrics function below as well as where I’m instantiating the training arguments and trainer.
Thank you!
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
# all unnecessary tokens are removed
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
training_args = TrainingArguments(
output_dir="./",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
evaluate_during_training=True,
do_train=True,
do_eval=True,
# fp16=True, # This has a known bug with t5
gradient_accumulation_steps=16,
logging_steps=100,
eval_steps=100,
overwrite_output_dir=True,
save_total_limit=10,
)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=valid_dataset
)