Training Bert2GPT2 model Summarization doesn't lead to acceptable results

I tried to warm-start an encoder-decoder model using different kinds of pretrained models (such as xlnet and roberta encoders with a gpt2 decoder) for the summarization task based on @patrickvonplaten 's great blog of using Bert2GPT2 for summarization. All of my experiments ended with very poor ROUGE results (close to zero in all Rouge scores). Then, I used exactly the same code in patrickvonplaten/bert2gpt2-cnn_dailymail-fp16 · Hugging Face using Colab Pro+ with only slight modifications to the training arguments and a batch_size of 4 instead of 16, and also got poor results (Rouge 2= 0.004) while it was (Rouge 2= 15.16) when I call the model with:

model = EncoderDecoderModel.from_pretrained(“patrickvonplaten/bert2gpt2-cnn_dailymail-fp16”)

Can you help me please to find out why did I get different and poor results after training the model for almost 14 hours while I’m using the same @patrickvonplaten 's code?

Thank you in advance :slight_smile:

The training arguments used:

training_args = TrainingArguments(
‘bertgpt2_cnn’,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
push_to_hub=True,
###evaluation_strategy=‘epoch’,
save_strategy=‘epoch’,
###predict_from_generate=True,
###evaluate_during_training=True,
do_train=True,
do_eval=True,
###logging_steps=1000,
###save_steps=1000,
###eval_steps=1000,
overwrite_output_dir=True,
warmup_steps=2000,
save_total_limit=10,
fp16=True,
)