Wandb does not display train/eval loss except for last one

Hello,

I am having difficulty getting my code to log metrics periodically to wandb, so I can check that I am checkpointing correctly. Specifically, although I am running my model for 10 epochs (with 2 examples per epoch for debugging) and am requesting logging every 2 steps, my wandb output displays only the very last metric for both train and eval, a single dot. The metric corresponds correctly to the output for epoch 10.

Could you please help me find the issue in my code/understanding?

I am adapting the following script to get it to save validation checkpoints periodically:

Specifically, the above after parsing my arguments:

python3 run_mlm.py             --model_name_or_path bert-base-uncased             --do_train             --do_eval             --output_dir ./models/development/Alex/with_tags            --train_file ./finetune/child/Alex/train.txt             --validation_file ./finetune/child/Alex/val.txt             --max_train_samples 2             --max_eval_samples 2 --overwrite_output_dir

and overwriting the default values in the TrainingArguments as follows in my version of the run_mlm.py:

 # Added these lines
    training_args.load_best_model_at_end = True
    training_args.metric_for_best_model = "eval_loss"
    # end added 
    
    # 8/7/21 added
    is_child = model_args.model_name_or_path != 'bert-base-uncased'
    num_epochs = 10 if is_child else 10 # Debug mode only!!!
    # end add
   
    
    # 8/1/21 added line
    training_args.save_total_limit = 1
    strategy = "steps"
    training_args.logging_strategy = strategy
    training_args.evaluation_strategy = strategy
    training_args.save_strategy = strategy
    
    # For the child scripts
    logger.info('run_mlm.py is in debug mode and is requesting epoch = 20 for non-child! Need to revert!')
    
    training_args.save_steps = interval_steps
    training_args.logging_steps = interval_steps
    training_args.eval_steps = interval_steps
    # end added
     
    # For now train for fewer epochs because perplexity difference is not very large.
    training_args.num_train_epochs = num_epochs
    training_args.learning_rate = learning_rate
    # end additions 
1 Like

Does wandb work any better with logging_steps=1 ? Try adding training_args.report_to = "wandb" also, as it might be needed in future transformers releases.

Do the logs from huggingface that get printed in the console print as expected or they’re also truncated?

I had the same issue and found: If you use the Trainer you would want to pass in the value evaluation_strategy = 'steps'. This adds the additional logging to wandb.