Albert MLM is slow

I’ve been using Albert for a masked language modeling (grammar correction) task, which generally works well. However, I found the various Albert models to be slow compared to Bert. I understand that Albert’s tokenizer is less efficient and that its repeating layers mean the gain (compared to Bert) mostly lies in memory rather than computation, but I wonder if these elements can explain the difference.

Applying a fill-mask pipeline on the sentence “The capital of Germany is [MASK]” gives me the following results (on a local machine):

  • bert-base-uncased: 0.024 seconds
  • albert-base-v2: 0.019 seconds (similar to Bert)
  • albert-large-v2: 0.0660 seconds (3x Bert)
  • albert-xlarge-v2: 0.275 seconds (>11x Bert!)

Can anyone explain this behavior?