T5 GPU Runtime Degradation

Hello,

Iā€™ve noticed that the running time of T5 on a GPU has increased between v3.4.0 and the current version (v4.2.1). When running inference on a single example on a K80 GPU (Google Colab), the average runtime of a generate() call for a single example (the one in the transformers documentation) with t5-base in v3.4.0 is 539 Ā± 13 ms, while the runtime for v4.2.1 is 627 Ā± 13 ms.
On t5-large, the difference is 1004 Ā± 22 ms, compared to 1242 Ā± 15 ms.

I made two colab notebooks that compare the two versions:

Iā€™m aware of a at least one bug fix that was made to the attention mechanism of T5 in v4.0.0 (#8158), but I donā€™t think this change should have caused such a degradation.
Any idea why such a degradation occurred?

Thanks!

Code snippet just in case:

import torch
import time
import numpy as np
from transformers import T5TokenizerFast, T5ForConditionalGeneration
from transformers import version as transformers_version
from torch import version as torch_version

device = torch.device(ā€˜cuda:0ā€™) if torch.cuda.is_available() else torch.device(ā€˜cpuā€™)
print(f"Using device: {device}")
t5_tokenizer = T5TokenizerFast.from_pretrained(ā€˜t5-baseā€™)
t5_model = T5ForConditionalGeneration.from_pretrained(ā€˜t5-baseā€™)
t5_model = t5_model.to(device)
t5_input_ids = t5_tokenizer("summarize: studies have shown that owning a dog is good for you ", return_tensors=ā€œptā€).input_ids # Batch size 1
t5_input_ids = t5_input_ids.to(device)

N = 100
times =
for _ in range(N):
start = time.time()
t5_outputs = t5_model.generate(t5_input_ids)
end = time.time()
times.append(end-start)
print(f"transformers version: {transformers_version}ā€œ)
print(f"torch version: {torch_version}ā€)
print(f"{1000np.mean(times):.0f} ms \u00B1 {1000np.std(times):.2f} ms per loop (mean \u00B1 std of {N} runs)")