Perplexity from fine-tuned GPT2LMHeadModel with and without lm_head as a parameter

NikitaSoni · April 8, 2022, 3:41am

Hello,

I am having a hard time convincing myself that following could be an expected behavior of GPT2LMHeadModel in the following scenarios:

Fine-tuning for LM task with new data: Training and Evaluation for 5 epochs
model = AutoModelForCausalLM.from_pretrained(‘gpt2’)
I get eval data perplexity in the order of ~40s.
Using the fine-tuned GPT2LMHead from 1 to reproduce evaluation results from 1 via:
model = AutoModelForCausalLM.from_pretrained(’<path_to_finetuned_model_from_1>’)
I get eval data perplexity in the order of ~300s
Using the fine-tuned GPT2LMHead from 1 to reproduce evaluation results from 1, but instead of
using AutoModelForCausalLM, I create a custom class exactly the same as GPT2LMHeadModel
and load the model as:
model = GPT2LMHeadModel.from_pretrained(’<path_to_finetuned_model_from_1>’)
I get eval data perplexity in the order of ~40s (basically, the same as 1)

So, now I wanted to get to the bottom of it:

Number of parameters in the model loaded as in 2. = 124439808
Number of parameters in the model loaded as in 3. = 163037184

This led me to Why is the lm_head layer in GPT2LMHeadModel not a parameter? , Why is the lm_head layer in GPT2LMHeadModel not a parameter? · Issue #6291 · huggingface/transformers · GitHub , and Clarification about GPT2LMHeadModel lm_head weights · Issue #3799 · huggingface/transformers · GitHub.

These posts made it clear that model loaded as in 3. had lm_head as additional parameters and the math added up, confirmed by [n for n,p in model.named_parameters()].

Now, this led me to debug the code to verify if tie_weights (and _tie_or_clone_weights) in modeling_utils.py was actually doing what @patrickvonplaten and @sgugger said in the above posts.
And I do find that the weights for output_embeddings were set the same as input_embeddings.
However, most likely the causing such large differences in my perplexity numbers.

All of the above made me curious with regards to the pre-trained GPT2 model-- so, I repeat 2. and 3. with ‘gpt2’. But this time, 3. gives me crazy huge numbers (i guess 8-10 digits long); and 2. gives me ~130s.

Finally, I am left wondering for a reasonable explanation. All I can imagine is that the fine-tuning from 1. also fine-tuned lm_head parameters which were in fact different from embedding weights and thus giving a decent increase in performance (lower perplexity). But, I’d definitely appreciate a deeper explanation. (tagging @sgugger , @patrickvonplaten for help.)

Thanks!

Nikita

NikitaSoni · April 8, 2022, 4:15am

I did verify model parameters with GPT2ForSequenceClassification, because I started to have my doubts in my downstream tasks results (using fine-tuned model from 1.) given the above initial perplexity irreproducibility.
But, I did not find any surprises over there, making me convinced that my results on those tasks are credible.

PS: I added this as a comment just to separate the main issue from a side detail.

patrickvonplaten · April 19, 2022, 12:13pm

Hey Nikita,

Sorry what is the question here exactly?

NikitaSoni · April 19, 2022, 2:40pm

Hey Patrick,

I am confused about the difference in perplexity numbers in the different scenarios.
I have tried to write it down below:

Is the lm_head supposed to be a separate parameter when fine-tuning? Since it is tied to the input embeddings weight.
1.1 If no, then is there something missing in my scenarios 2 and 3?
1.2 if yes, then is it just for the pre-trained GPT2LMHead model that we tie the lm_head weights to input embedding weights?

Thanks.

NikitaSoni · May 10, 2022, 9:46pm

hi @patrickvonplaten, I was wondering if my question/confusion was clearer from my last comment and if you’d be able to help me understand the differences better. Thanks!

Topic		Replies	Views
Why is the lm_head layer in GPT2LMHeadModel not a parameter? Beginners	5	8001	September 29, 2023
How the lm_head weights are tight to embeddings in GPT2LMHeadModel? Beginners	0	724	December 18, 2021
Perplexity Calculation in run_clm.py 🤗Transformers	0	272	May 23, 2024
Load fine-tuned LM without the head? Beginners	2	1546	February 22, 2022
Huge discrepancy in perplexity of LLM for Trainer v/s scratch implementation? Beginners	1	133	October 24, 2024

Perplexity from fine-tuned GPT2LMHeadModel with and without lm_head as a parameter

Related topics