I wonder how the GPT2 pretained models were created. The original models were checkpointed with the tensorflow 1 API and a substantially different computation graph than the reimplemantion in huggingface transformers? I wonder what you did do get there.
Have you found a way to adapt the originally published weights?
Have the openai developers shared WebText with you?
Have you trained the models on similar data?
Thanks for your help