Reproduce BERT and RoBERTa

Hi to all,

I tried to use the running run mlm.py to reproduce the result of the bert-base-uncased version. However, I found my reproduced results are always lower than the one reported in this website provided by the Huggingface team.

Task Metric Reported by Huggingface Our reproduced result
CoLA Matthew’s corr 56.53 47.92
SST-2 Accuracy 92.32 87.56
MRPC F1/Accuracy 88.85/84.07 82.03/80.97
STS-B Person/Spearman corr. 88.64/88.48 82.45/82.76
QQP Accuracy/F1 90.71/87.49 88.23/86.12
MNLI Matched acc./Mismatched acc. 83.91/84.10 82.34/83.01
QNLI Accuracy 90.66 85.45
RTE Accuracy 65.70 56.95

I think there must be some problems with my experiment. I ran my experiment by using:

(1) I used the code in this file without any change.

(2) I loaded the datasets of bookcorpus and wiki directly from dataset library; the text is chunked into 512 tokens.

(3) I set the batch size as 256 and ran 1M steps; and batch size as 8K and ran 50K steps. Both results are worse than the reported numbers.

I really apprecitate if you could provide me a script that I can use to reproduce BERT or RoBERTa. Thank you very much!

1 Like

It’s very usual to have lower scores than the official team, probably due to randomness. But I’m still interested in why this is the case. Do you have any updates on this? Looking forward to it! @wyu1