Replicating RoBERTa-base GLUE results

Hi,

I’m trying to replicate RoBERTa-base finetuning results using the run_glue.py script. Using the pretrained roberta-base from model hub, I’m finding it difficult to match the results reported in the model card.

For example, iterating on RTE, the best I’ve achieved is around 77.98 (compared to 78.7 reported in the model card) using the params:

--model_name_or_path=roberta-base \
--task_name=rte \
--seed=42 \
--max_seq_length=512 \
--num_train_epochs=10 \
--per_device_train_batch_size=16 \
--learning_rate=2e-5 \
--weight_decay=0.1 \
--warmup_ratio=0.06 \
--adam_beta1=0.9 \
--adam_beta2=0.98 \
--adam_epsilon=1e-6 \
--evaluation_strategy=steps \
--eval_steps=50 \
--load_best_model_at_end=true \
--metric_for_best_model=eval_accuracy \
--save_steps=50

Most of these params follow the original fairseq configs closely, but perhaps Hugging Face expects something different.

I see a similar discrepancy for CoLA, STS-B, and MRPC, which brings my GLUE average down to 85.4 compared to reported 86.3.

Would someone be able to kindly share the configuration(s) used to achieve the numbers reported in the model card, for RTE and other GLUE tasks? Much appreciated!