Wandb does not display train/eval loss except for last one

w-nicole · August 12, 2021, 12:51am

Hello,

I am having difficulty getting my code to log metrics periodically to wandb, so I can check that I am checkpointing correctly. Specifically, although I am running my model for 10 epochs (with 2 examples per epoch for debugging) and am requesting logging every 2 steps, my wandb output displays only the very last metric for both train and eval, a single dot. The metric corresponds correctly to the output for epoch 10.

Could you please help me find the issue in my code/understanding?

I am adapting the following script to get it to save validation checkpoints periodically:

github.com

huggingface/transformers/blob/v4.6.1/examples/pytorch/language-modeling/run_mlm.py

#!/usr/bin/env python
# coding=utf-8
# Copyright 2020 The HuggingFace Team All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for masked language modeling (BERT, ALBERT, RoBERTa...) on a text file or a dataset.

Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
https://huggingface.co/models?filter=masked-lm

This file has been truncated. show original

Specifically, the above after parsing my arguments:

python3 run_mlm.py             --model_name_or_path bert-base-uncased             --do_train             --do_eval             --output_dir ./models/development/Alex/with_tags            --train_file ./finetune/child/Alex/train.txt             --validation_file ./finetune/child/Alex/val.txt             --max_train_samples 2             --max_eval_samples 2 --overwrite_output_dir

and overwriting the default values in the TrainingArguments as follows in my version of the run_mlm.py:

 # Added these lines
    training_args.load_best_model_at_end = True
    training_args.metric_for_best_model = "eval_loss"
    # end added 
    
    # 8/7/21 added
    is_child = model_args.model_name_or_path != 'bert-base-uncased'
    num_epochs = 10 if is_child else 10 # Debug mode only!!!
    # end add
   
    
    # 8/1/21 added line
    training_args.save_total_limit = 1
    strategy = "steps"
    training_args.logging_strategy = strategy
    training_args.evaluation_strategy = strategy
    training_args.save_strategy = strategy
    
    # For the child scripts
    logger.info('run_mlm.py is in debug mode and is requesting epoch = 20 for non-child! Need to revert!')
    
    training_args.save_steps = interval_steps
    training_args.logging_steps = interval_steps
    training_args.eval_steps = interval_steps
    # end added
     
    # For now train for fewer epochs because perplexity difference is not very large.
    training_args.num_train_epochs = num_epochs
    training_args.learning_rate = learning_rate
    # end additions

morgan · September 7, 2021, 2:28pm

Does wandb work any better with logging_steps=1 ? Try adding training_args.report_to = "wandb" also, as it might be needed in future transformers releases.

Do the logs from huggingface that get printed in the console print as expected or they’re also truncated?

mpost · March 4, 2022, 7:58am

I had the same issue and found: If you use the Trainer you would want to pass in the value evaluation_strategy = 'steps'. This adds the additional logging to wandb.

Topic		Replies	Views
WandB does not log train loss Beginners	0	62	November 7, 2024
Wandb for Huggingface Trainer saves only first model 🤗Transformers	0	442	April 19, 2022
Questions about default checkpointing behavior (train v. val) Beginners	4	1001	October 16, 2023
HuggingFace Trainer - Eval loss abruptly goes up at the last step of training 🤗Transformers	1	1990	November 8, 2022
Logging & Experiment tracking with W&B 🤗Transformers	78	44825	February 28, 2024

Wandb does not display train/eval loss except for last one

Related topics