Hi to all,
I tried to use the running run mlm.py
to reproduce the result of the bert-base-uncased version. However, I found my reproduced results are always lower than the one reported in this website provided by the Huggingface team.
Task | Metric | Reported by Huggingface | Our reproduced result |
---|---|---|---|
CoLA | Matthew’s corr | 56.53 | 47.92 |
SST-2 | Accuracy | 92.32 | 87.56 |
MRPC | F1/Accuracy | 88.85/84.07 | 82.03/80.97 |
STS-B | Person/Spearman corr. | 88.64/88.48 | 82.45/82.76 |
QQP | Accuracy/F1 | 90.71/87.49 | 88.23/86.12 |
MNLI | Matched acc./Mismatched acc. | 83.91/84.10 | 82.34/83.01 |
QNLI | Accuracy | 90.66 | 85.45 |
RTE | Accuracy | 65.70 | 56.95 |
I think there must be some problems with my experiment. I ran my experiment by using:
(1) I used the code in this file without any change.
(2) I loaded the datasets of bookcorpus and wiki directly from dataset
library; the text is chunked into 512 tokens.
(3) I set the batch size as 256 and ran 1M steps; and batch size as 8K and ran 50K steps. Both results are worse than the reported numbers.
I really apprecitate if you could provide me a script that I can use to reproduce BERT or RoBERTa. Thank you very much!