Is it just me or is mobileBERT is much slower than DistilBERT on Huggingface? When I train/fine-tune MobileBERT on GTX 1070, I get 3.8 it/sec. However, when I train DistilBERT on the same GPU, I get 15 it/sec. Am I missing something? The paper on MobileBERT states that MobileBERT should be faster.
I have exactly the same issue. Too slow. And in my case, it doesn’t converge at all. I increased the learning rate as was advised in the paper, but it didn’t help. Any idea?
My code is working perfectly with normal bert and distill_bert! But very bad performance with mobilebert.
Mine does converge but the training time takes longer than DistilBERT. I keep the learning rate at 2e-5. Try using without a lr scheduler?