Trainer doesn't show the loss at each step

Hi, I built from source yesterday but I still don’t think I’m seeing the expected behavior when it comes to logging.
With gradient_accumulation_steps=16, logging_steps=100 and eval_steps=100, I expect to see both the loss and validation metrics printed at iteration 100 but nothing is printed at step 100.
With gradient_accumulation_steps=1, logging_steps=100 and eval_steps=100, only the loss and learning rate (no eval metrics) are printed once at step 100 and then at step 200 cuda runs out of memory. (With the prev config gradient_accumulation_steps=16, logging_steps=100 and eval_steps=100, the memory crash doesn’t happen).
In addition I can’t seem to find any of the train or eval metrics in the tf event files. Is there anything that needs to be done explicitly to log info to tensorboard?
I added my compute metrics function below as well as where I’m instantiating the training arguments and trainer.

Thank you!

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

training_args = TrainingArguments(
    output_dir="./",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluate_during_training=True,
    do_train=True,
    do_eval=True,
    # fp16=True,  # This has a known bug with t5
    gradient_accumulation_steps=16,
    logging_steps=100,
    eval_steps=100,
    overwrite_output_dir=True,
    save_total_limit=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset
)
1 Like