Trainer.evaluate()

Hello,

When the following code is run several times (notebook language_modeling.ipynb), it gives a diferent value at each time:

import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

I do not understand why (the eval loss should be always the same when using the same eval dataset and the same model).

I dont think its the exact same model and the same evaluation dataset everytime. Can you actually save and check the eval dataset across two runs please

Hello @infinitejoy.

How the model could not be the same?
When you run trainer.evaluate(), the model used is this one: trainer.model.

Just run trainer.evaluate() 2 times in a row in the Colab notebook language.modeling.ipynb and you’ll see a different perplexity (… with the same model).

import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Same comment: when you run 2 times the same code trainer.evaluate(), it is the same evaluation set defined at the beginning of the notebook by the following code:

from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Could you run the Colab notebook and publish your results? Thanks. (I did it for Masked LM, not for Causal LM)

For masked LM, the tokens are randomly masked each time you go through the dataloader (for training and evaluation) in this notebook. That is why you see slightly different results at each run.

1 Like