Replicating RoBERTa-base GLUE results

marklee · June 18, 2022, 10:11pm

Hi,

I’m trying to replicate RoBERTa-base finetuning results using the run_glue.py script. Using the pretrained roberta-base from model hub, I’m finding it difficult to match the results reported in the model card.

For example, iterating on RTE, the best I’ve achieved is around 77.98 (compared to 78.7 reported in the model card) using the params:

--model_name_or_path=roberta-base \
--task_name=rte \
--seed=42 \
--max_seq_length=512 \
--num_train_epochs=10 \
--per_device_train_batch_size=16 \
--learning_rate=2e-5 \
--weight_decay=0.1 \
--warmup_ratio=0.06 \
--adam_beta1=0.9 \
--adam_beta2=0.98 \
--adam_epsilon=1e-6 \
--evaluation_strategy=steps \
--eval_steps=50 \
--load_best_model_at_end=true \
--metric_for_best_model=eval_accuracy \
--save_steps=50

Most of these params follow the original fairseq configs closely, but perhaps Hugging Face expects something different.

I see a similar discrepancy for CoLA, STS-B, and MRPC, which brings my GLUE average down to 85.4 compared to reported 86.3.

Would someone be able to kindly share the configuration(s) used to achieve the numbers reported in the model card, for RTE and other GLUE tasks? Much appreciated!

Topic		Replies	Views
Reproduce BERT and RoBERTa 🤗Transformers	1	974	July 24, 2023
Casual LM on GLUE dataset 🤗Transformers	0	142	September 2, 2023
Run_glue.py provides higher GLUE score on bert-base-uncased 🤗Transformers	0	261	April 6, 2023
I'm making ROBERTA dumber, and I don't know why Beginners	1	341	March 8, 2021
Cannot fine-tune RobertaForQA on SQuAD-like dataset? Beginners	0	273	November 15, 2021

Replicating RoBERTa-base GLUE results

Related topics