Fine-tuning gpt2 generates repetive test despte many hyperparameters, gpt-large/xl?

Hello,

It seems using the run_clm.py training script overfits my dataset. After I train a model for 1 epoch at lrs between 1e6 and 1e4 it will only produce the same sentence repeating until the end of the block. A sample run looks like this:

python run_clm.py
–model_name_or_path gpt2-medium
–train_data_file train.txt
–do_train
–output_dir output-gpt2
–per_device_train_batch_size=1
–save_steps 1000
–num_train_epochs=1\

My training file is processed from the cornell movie database, it is formatted line by line.

Any advice would be awesome I’ll be using the model down stream so I’m looking to get a nice low loss and move on.

Additionally has anyone been able to train gpt2-large or ideal gpt2-xl? Can we distribute gpt2-large on a colab tpu with batch_size =1?

Thank you!