I am trying to evaluate the perplexity of a model on WikiText-2.
The three code sources I am using are:
- GitHub - yxli2123/LoftQ
- GitHub - horseee/LLM-Pruner: [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support LLaMA, Llama-2, BLOOM, Vicuna, Baichuan, etc.
- GitHub - locuslab/wanda: A simple and effective LLM pruning approach.
2 and 3 agree, but 1 which is based on the run_clm.py script from transformers/examples/pytorch/language-modeling/run_clm.py at main · huggingface/transformers · GitHub seems to be significantly different.
For example, when evaluation Llama-2 13b I get the following respective perplexities (using seq_length 1024)
- 12.02
- 5.43
- 5.43
Does anyone know why the value obtained from 1. is significantly different from the other values?