Trainer doesn't show the loss at each step

pyantony · July 30, 2020, 9:34pm

Hi,

is there a way to display/print the loss (or metrics if you are evaluating) at each step (or n steps) or every time you log? I don’t see any option for that. This is very important cause’ it is the only way to tell if the model is learning or not. I thought “debug” was going to work but it seems to be deprecated. I am training in a jupyter notebook by the way.

Also, what about an early stopping option?

sgugger · July 30, 2020, 9:59pm

The loss and metrics are printed every logging_steps (there was w bug recently fixed, so you might need to update your install to an installation from source). As for early stopping, there is a PR under review with it, so it should come soon.

pyantony · July 30, 2020, 10:22pm

mmm, will try updating then. Thank you!

melody-ju · September 3, 2020, 7:28am

Hi, I built from source yesterday but I still don’t think I’m seeing the expected behavior when it comes to logging.
With gradient_accumulation_steps=16, logging_steps=100 and eval_steps=100, I expect to see both the loss and validation metrics printed at iteration 100 but nothing is printed at step 100.
With gradient_accumulation_steps=1, logging_steps=100 and eval_steps=100, only the loss and learning rate (no eval metrics) are printed once at step 100 and then at step 200 cuda runs out of memory. (With the prev config gradient_accumulation_steps=16, logging_steps=100 and eval_steps=100, the memory crash doesn’t happen).
In addition I can’t seem to find any of the train or eval metrics in the tf event files. Is there anything that needs to be done explicitly to log info to tensorboard?
I added my compute metrics function below as well as where I’m instantiating the training arguments and trainer.

Thank you!

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

training_args = TrainingArguments(
    output_dir="./",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluate_during_training=True,
    do_train=True,
    do_eval=True,
    # fp16=True,  # This has a known bug with t5
    gradient_accumulation_steps=16,
    logging_steps=100,
    eval_steps=100,
    overwrite_output_dir=True,
    save_total_limit=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset
)

sgugger · September 3, 2020, 11:14am

So the explanation of the first point is that right now, your eval_steps and logging steps have to be a round multiple of your gradient accumulation steps since those are tested only when you actually do an update (might fix that in the future but I have to think of it more). Since 100 is not a round multiple of 16, you can’t see anything.

In the second test, the metrics should be logged at step 100, so this one is weird. If you don’t see them printed, it’s logical you don’t see them in tensorboard either, it just means they were never logged.

melody-ju · September 3, 2020, 1:06pm

Hi, thanks for your reply!
Coming back to 100 not being a round mulitple of 16 I changed the gradient accumulation steps to 32 and logging steps to 128. I would expect the logging to happen then at 128th step, but nothing is printed. Any insight?

melody-ju · September 3, 2020, 2:25pm

Hmmm I want to add that although save_steps is 512 there has been nothing written to the specified checkpoint output dir. (Now it is on step ~2000, and still nothing printed for logging either).
Showing my full training args below.

Any insight would be greatly appreciated. I’m really scratching my head over the logging and saving issue.

batch_size = 1

training_args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=batch_size,
    do_train=True,
    # fp16=True,  # This has a known bug with t5
    gradient_accumulation_steps=32,
    logging_steps=128,
    save_steps=512,
    overwrite_output_dir=True,
    save_total_limit=10,
)

optimizer = Adafactor(model.parameters(), lr=1e-3, relative_step=False, warmup_init=False)
scheduler = get_constant_schedule(optimizer)
optimizers = optimizer, scheduler

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    optimizers=optimizers
)

trainer.train()

sgugger · September 3, 2020, 8:37pm

Ah, sorry I misread the code. When doing gradient accumulation, one step is one backward pass, so appends every gradient_accumulation_steps examples. So you need to lower your logging_steps/eval_steps/save_steps to see something happening.

melody-ju · September 4, 2020, 9:18am

Thanks, that makes sense

sgugger · September 4, 2020, 12:46pm

I made a PR to add a warning about this in the documentation, so that other users are not surprised

poli · September 30, 2020, 9:53pm

I have the same problem. I don’t see the loss reported. Using version 3.0.2. Here is my code:
training_args = TrainingArguments(
output_dir="./ClaimsBERTo_WordLevel",
overwrite_output_dir=False,
num_train_epochs=3,
per_gpu_train_batch_size=64,
save_steps=10,
logging_steps=10,
eval_steps=10,
save_total_limit=5,
evaluate_during_training=True,
do_eval=True,
logging_dir=’./logs’,
)

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=val_dataset,
prediction_loss_only=True,
)

I hope someone will be able to help.
Thank you

poli · October 1, 2020, 4:20pm

I am not sure it was done correctly but once I added the following: logging.basicConfig(level=logging.DEBUG,
format=’%(asctime)s %(message)s’)
to the trainer.py I could see the losses printed.

brando · August 16, 2022, 4:38pm

@sgugger @poli I can’t see the train loss, thats what I want to see:

{'eval_loss': 2.9247915744781494, 'eval_runtime': 0.0168, 'eval_samples_per_second': 595.866, 'eval_steps_per_second': 59.587, 'epoch': 2822.0}                                                                                                                          
 28%|██████████████████████████████████████████████████████████████▉

SUNM · May 19, 2023, 3:52am

Hi @brando , how are u doing? sorry I wana see training and validation loss. this is my code. I wana see the logs with tensorboard. do you have any idea how it is possible?

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,logging_steps=5000,
                                per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

ivanzhouyq · July 7, 2023, 8:46pm

I encountered the same problem – I cannot see train loss or throughput. Only eval loss and samples per second is shown:

{'eval_loss': 8.6328125, 'eval_runtime': 26.1463, 'eval_samples_per_second': 17.134, 'eval_steps_per_second': 0.535, 'epoch': 0.71}

lokesh005 · October 4, 2023, 3:38pm

I think you just need to scroll through the nine panels available (only 6 out of 9 displayed on your screenshot) and you’ll find the actual training loss under train/loss and not train/train_loss

chen806 · October 11, 2023, 3:16am

I am not sure if it is jupyter or hf trainer issue. I try to play around with a lot of combinations of eval_steps, evaluation_strategy, logging steps. They do not work as expected. Though i am setting logging_steps = 50. I am still seeing loss every 10 steps and eval_steps does not work either. Anyone knows why?

fulltrend · November 30, 2023, 8:37pm

This is coming late to the game, but in case someone else still struggles - you may need to provide a valid report_to param, such as:
report_to='tensorboard'
so that the trainer knows how to format the output.

kkruskal · January 3, 2024, 11:53am

I’ve found that if you use JupyterNotebook in Pycharm it causes this problem, running it in a browser everything works fine.

SCUER · January 24, 2024, 1:15pm

i have been use notebook in pycharm，i will try

Topic		Replies	Views
Trainer log my custom metrics at training step Beginners	3	3978	July 11, 2024
Logs of training and validation loss Beginners	10	32540	February 14, 2025
How do i get Training and Validation Loss during fine tuning 🤗Transformers	2	14676	August 27, 2021
Trainer does not print to console the loss (train and eval) Beginners	0	1737	June 24, 2023
[trainer] 'train_loss' different from 'loss' 🤗Transformers	4	4703	March 31, 2023

Trainer doesn't show the loss at each step

Related topics