Trainer.evaluate()

pierreguillou · May 10, 2021, 8:18pm

Hello,

When the following code is run several times (notebook language_modeling.ipynb), it gives a diferent value at each time:

import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

I do not understand why (the eval loss should be always the same when using the same eval dataset and the same model).

infinitejoy · May 11, 2021, 12:05pm

I dont think its the exact same model and the same evaluation dataset everytime. Can you actually save and check the eval dataset across two runs please

pierreguillou · May 11, 2021, 12:59pm

Hello @infinitejoy.

How the model could not be the same?
When you run trainer.evaluate(), the model used is this one: trainer.model.

Just run trainer.evaluate() 2 times in a row in the Colab notebook language.modeling.ipynb and you’ll see a different perplexity (… with the same model).

import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Same comment: when you run 2 times the same code trainer.evaluate(), it is the same evaluation set defined at the beginning of the notebook by the following code:

from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Could you run the Colab notebook and publish your results? Thanks. (I did it for Masked LM, not for Causal LM)

sgugger · May 11, 2021, 1:05pm

For masked LM, the tokens are randomly masked each time you go through the dataloader (for training and evaluation) in this notebook. That is why you see slightly different results at each run.

Topic		Replies	Views
Huge discrepancy in perplexity of LLM for Trainer v/s scratch implementation? Beginners	1	133	October 24, 2024
Run_mlm.py: Why does eval_loss at the last epoch differ from the do_eval eval_loss? Beginners	2	786	March 18, 2021
Huggingface Trainer eval while training 🤗Transformers	1	722	December 31, 2022
Evaluate Model on Test dataset (PPL) Beginners	3	1485	June 10, 2021
Perplexity randomly failing due to missing cache file Beginners	1	583	January 27, 2024

Trainer.evaluate()

Related topics