Perplexity from fine-tuned GPT2LMHeadModel with and without lm_head as a parameter

Hello,

I am having a hard time convincing myself that following could be an expected behavior of GPT2LMHeadModel in the following scenarios:

  1. Fine-tuning for LM task with new data: Training and Evaluation for 5 epochs
    model = AutoModelForCausalLM.from_pretrained(‘gpt2’)
    I get eval data perplexity in the order of ~40s.

  2. Using the fine-tuned GPT2LMHead from 1 to reproduce evaluation results from 1 via:
    model = AutoModelForCausalLM.from_pretrained(’<path_to_finetuned_model_from_1>’)
    I get eval data perplexity in the order of ~300s

  3. Using the fine-tuned GPT2LMHead from 1 to reproduce evaluation results from 1, but instead of
    using AutoModelForCausalLM, I create a custom class exactly the same as GPT2LMHeadModel
    and load the model as:
    model = GPT2LMHeadModel.from_pretrained(’<path_to_finetuned_model_from_1>’)
    I get eval data perplexity in the order of ~40s (basically, the same as 1)

So, now I wanted to get to the bottom of it:

Number of parameters in the model loaded as in 2. = 124439808
Number of parameters in the model loaded as in 3. = 163037184

This led me to Why is the lm_head layer in GPT2LMHeadModel not a parameter? , Why is the lm_head layer in GPT2LMHeadModel not a parameter? · Issue #6291 · huggingface/transformers · GitHub , and Clarification about GPT2LMHeadModel lm_head weights · Issue #3799 · huggingface/transformers · GitHub.

These posts made it clear that model loaded as in 3. had lm_head as additional parameters and the math added up, confirmed by [n for n,p in model.named_parameters()].

Now, this led me to debug the code to verify if tie_weights (and _tie_or_clone_weights) in modeling_utils.py was actually doing what @patrickvonplaten and @sgugger said in the above posts.
And I do find that the weights for output_embeddings were set the same as input_embeddings.
However, most likely the causing such large differences in my perplexity numbers.

All of the above made me curious with regards to the pre-trained GPT2 model-- so, I repeat 2. and 3. with ‘gpt2’. But this time, 3. gives me crazy huge numbers (i guess 8-10 digits long); and 2. gives me ~130s.

Finally, I am left wondering for a reasonable explanation. All I can imagine is that the fine-tuning from 1. also fine-tuned lm_head parameters which were in fact different from embedding weights and thus giving a decent increase in performance (lower perplexity). But, I’d definitely appreciate a deeper explanation. (tagging @sgugger , @patrickvonplaten for help.)

Thanks!

Nikita

I did verify model parameters with GPT2ForSequenceClassification, because I started to have my doubts in my downstream tasks results (using fine-tuned model from 1.) given the above initial perplexity irreproducibility.
But, I did not find any surprises over there, making me convinced that my results on those tasks are credible.

PS: I added this as a comment just to separate the main issue from a side detail.

Hey Nikita,

Sorry what is the question here exactly?

Hey Patrick,

I am confused about the difference in perplexity numbers in the different scenarios.
I have tried to write it down below:

  1. Is the lm_head supposed to be a separate parameter when fine-tuning? Since it is tied to the input embeddings weight.
    1.1 If no, then is there something missing in my scenarios 2 and 3?
    1.2 if yes, then is it just for the pre-trained GPT2LMHead model that we tie the lm_head weights to input embedding weights?

Thanks.

hi @patrickvonplaten, I was wondering if my question/confusion was clearer from my last comment and if you’d be able to help me understand the differences better. Thanks! :slight_smile: