I’ve been trying to retrain DistilRoBERTa from the information given here along with the example code/documentation here.
I’m a bit unclear on the exact configuration used to train the DistilRoBERTa model. I have been assuming it uses the same configuration as the DistilBERT model with minor changes, though some things, such as the loss coefficients are still a bit ambiguous.
Would it be possible to share the exact command/configuration to train DistilRoBERTa?
I’ve been able to replicate DistilRoBERTa to similar evaluation MLM perplexity but there still seems to be a small but statistically significant difference, I can share the full config if it’s helpful.
Thank you!