Is there any reason why GPT-Neo would behave differently (fundamentally) from GPT2?

Hey guys. I’m running some experiments as part of a research project; it was initially implemented for GPT-Neo 1.3B, but there is one baseline we want to use that only supports GPT2-XL, so I implemented that into our code (i.e., just included a clause that was like “if model_name=‘gpt2’, model=GPT2LMHeadModel.frompretrained(‘gpt2-xl’)”). Both of these are of course from huggingface.

The issue is, GPT2 is giving absolutely absurd results that are clearly very incorrect. It is difficult to explain this without an in-depth explanation of my code, but basically I am doing a multiple-choice test where the model is rewarded for assigning the highest probability to the correct label; GPT2 assigns the exact same probability to all but one of the labels, and assigns a very small probability to that one (which usually isn’t even the right answer anyway). Furthermore, fine-tuning on this dataset has literally no impact on the results, which is bizarre.

So my question is, is there any fundamental difference in how these two models are setup in hugging face, that would result in such errors? I.e., theoretically, is there anything that I have to do to change the code to fit GPT2 better, besides the initial “model=GPT2… statement”? I myself am not too familiar with hugging face models, so I’m not entirely sure. But the fact that the code runs but produces bad errors is weird; I would think that if something was wrong, there would be some sort of tensor-size doesn’t match error somewhere…