Hello,
I am having a hard time convincing myself that following could be an expected behavior of GPT2LMHeadModel in the following scenarios:
-
Fine-tuning for LM task with new data: Training and Evaluation for 5 epochs
model = AutoModelForCausalLM.from_pretrained(‘gpt2’)
I get eval data perplexity in the order of ~40s. -
Using the fine-tuned GPT2LMHead from 1 to reproduce evaluation results from 1 via:
model = AutoModelForCausalLM.from_pretrained(’<path_to_finetuned_model_from_1>’)
I get eval data perplexity in the order of ~300s -
Using the fine-tuned GPT2LMHead from 1 to reproduce evaluation results from 1, but instead of
using AutoModelForCausalLM, I create a custom class exactly the same as GPT2LMHeadModel
and load the model as:
model = GPT2LMHeadModel.from_pretrained(’<path_to_finetuned_model_from_1>’)
I get eval data perplexity in the order of ~40s (basically, the same as 1)
So, now I wanted to get to the bottom of it:
Number of parameters in the model loaded as in 2. = 124439808
Number of parameters in the model loaded as in 3. = 163037184
This led me to Why is the lm_head layer in GPT2LMHeadModel not a parameter? , Why is the lm_head layer in GPT2LMHeadModel not a parameter? · Issue #6291 · huggingface/transformers · GitHub , and Clarification about GPT2LMHeadModel lm_head weights · Issue #3799 · huggingface/transformers · GitHub.
These posts made it clear that model loaded as in 3. had lm_head as additional parameters and the math added up, confirmed by [n for n,p in model.named_parameters()].
Now, this led me to debug the code to verify if tie_weights (and _tie_or_clone_weights) in modeling_utils.py was actually doing what @patrickvonplaten and @sgugger said in the above posts.
And I do find that the weights for output_embeddings were set the same as input_embeddings.
However, most likely the causing such large differences in my perplexity numbers.
All of the above made me curious with regards to the pre-trained GPT2 model-- so, I repeat 2. and 3. with ‘gpt2’. But this time, 3. gives me crazy huge numbers (i guess 8-10 digits long); and 2. gives me ~130s.
Finally, I am left wondering for a reasonable explanation. All I can imagine is that the fine-tuning from 1. also fine-tuned lm_head parameters which were in fact different from embedding weights and thus giving a decent increase in performance (lower perplexity). But, I’d definitely appreciate a deeper explanation. (tagging @sgugger , @patrickvonplaten for help.)
Thanks!
Nikita