I’ve been using Albert for a masked language modeling (grammar correction) task, which generally works well. However, I found the various Albert models to be slow compared to Bert. I understand that Albert’s tokenizer is less efficient and that its repeating layers mean the gain (compared to Bert) mostly lies in memory rather than computation, but I wonder if these elements can explain the difference.
Applying a fill-mask pipeline on the sentence “The capital of Germany is [MASK]” gives me the following results (on a local machine):
- bert-base-uncased: 0.024 seconds
- albert-base-v2: 0.019 seconds (similar to Bert)
- albert-large-v2: 0.0660 seconds (3x Bert)
- albert-xlarge-v2: 0.275 seconds (>11x Bert!)
Can anyone explain this behavior?