Trainer doesn't show the loss at each step

Hi,

is there a way to display/print the loss (or metrics if you are evaluating) at each step (or n steps) or every time you log? I don’t see any option for that. This is very important cause’ it is the only way to tell if the model is learning or not. I thought “debug” was going to work but it seems to be deprecated. I am training in a jupyter notebook by the way.

Also, what about an early stopping option?

2 Likes

The loss and metrics are printed every logging_steps (there was w bug recently fixed, so you might need to update your install to an installation from source). As for early stopping, there is a PR under review with it, so it should come soon.

mmm, will try updating then. Thank you!

Hi, I built from source yesterday but I still don’t think I’m seeing the expected behavior when it comes to logging.
With gradient_accumulation_steps=16, logging_steps=100 and eval_steps=100, I expect to see both the loss and validation metrics printed at iteration 100 but nothing is printed at step 100.
With gradient_accumulation_steps=1, logging_steps=100 and eval_steps=100, only the loss and learning rate (no eval metrics) are printed once at step 100 and then at step 200 cuda runs out of memory. (With the prev config gradient_accumulation_steps=16, logging_steps=100 and eval_steps=100, the memory crash doesn’t happen).
In addition I can’t seem to find any of the train or eval metrics in the tf event files. Is there anything that needs to be done explicitly to log info to tensorboard?
I added my compute metrics function below as well as where I’m instantiating the training arguments and trainer.

Thank you!

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

training_args = TrainingArguments(
    output_dir="./",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluate_during_training=True,
    do_train=True,
    do_eval=True,
    # fp16=True,  # This has a known bug with t5
    gradient_accumulation_steps=16,
    logging_steps=100,
    eval_steps=100,
    overwrite_output_dir=True,
    save_total_limit=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset
)
1 Like

So the explanation of the first point is that right now, your eval_steps and logging steps have to be a round multiple of your gradient accumulation steps since those are tested only when you actually do an update (might fix that in the future but I have to think of it more). Since 100 is not a round multiple of 16, you can’t see anything.

In the second test, the metrics should be logged at step 100, so this one is weird. If you don’t see them printed, it’s logical you don’t see them in tensorboard either, it just means they were never logged.

Hi, thanks for your reply!
Coming back to 100 not being a round mulitple of 16 I changed the gradient accumulation steps to 32 and logging steps to 128. I would expect the logging to happen then at 128th step, but nothing is printed. Any insight?

Hmmm I want to add that although save_steps is 512 there has been nothing written to the specified checkpoint output dir. (Now it is on step ~2000, and still nothing printed for logging either).
Showing my full training args below.

Any insight would be greatly appreciated. I’m really scratching my head over the logging and saving issue.

batch_size = 1

training_args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=batch_size,
    do_train=True,
    # fp16=True,  # This has a known bug with t5
    gradient_accumulation_steps=32,
    logging_steps=128,
    save_steps=512,
    overwrite_output_dir=True,
    save_total_limit=10,
)

optimizer = Adafactor(model.parameters(), lr=1e-3, relative_step=False, warmup_init=False)
scheduler = get_constant_schedule(optimizer)
optimizers = optimizer, scheduler

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    optimizers=optimizers
)

trainer.train()

Ah, sorry I misread the code. When doing gradient accumulation, one step is one backward pass, so appends every gradient_accumulation_steps examples. So you need to lower your logging_steps/eval_steps/save_steps to see something happening.

Thanks, that makes sense :slight_smile:

I made a PR to add a warning about this in the documentation, so that other users are not surprised :slight_smile:

I have the same problem. I don’t see the loss reported. Using version 3.0.2. Here is my code:
training_args = TrainingArguments(
output_dir="./ClaimsBERTo_WordLevel",
overwrite_output_dir=False,
num_train_epochs=3,
per_gpu_train_batch_size=64,
save_steps=10,
logging_steps=10,
eval_steps=10,
save_total_limit=5,
evaluate_during_training=True,
do_eval=True,
logging_dir=’./logs’,
)

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=val_dataset,
prediction_loss_only=True,
)

I hope someone will be able to help.
Thank you

I am not sure it was done correctly but once I added the following: logging.basicConfig(level=logging.DEBUG,
format=’%(asctime)s %(message)s’)
to the trainer.py I could see the losses printed.