Perplexity Calculation in run_clm.py

I am trying to evaluate the perplexity of a model on WikiText-2.

The three code sources I am using are:

  1. GitHub - yxli2123/LoftQ
  2. GitHub - horseee/LLM-Pruner: [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support LLaMA, Llama-2, BLOOM, Vicuna, Baichuan, etc.
  3. GitHub - locuslab/wanda: A simple and effective LLM pruning approach.

2 and 3 agree, but 1 which is based on the run_clm.py script from transformers/examples/pytorch/language-modeling/run_clm.py at main · huggingface/transformers · GitHub seems to be significantly different.

For example, when evaluation Llama-2 13b I get the following respective perplexities (using seq_length 1024)

  1. 12.02
  2. 5.43
  3. 5.43

Does anyone know why the value obtained from 1. is significantly different from the other values?