I’m reproducing the glue result for the paper “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, and I now get MNLI-m dev set about 80 acc, and the score in paper is 82.
Here is the parameter I’m using:
epoch=3
lr=2e-5
batchsize=32*4cards.
Can anybody share the hyper-parameter for the experiment in this paper?