T5 GPU Runtime Degradation


I’ve noticed that the running time of T5 on a GPU has increased between v3.4.0 and the current version (v4.2.1). When running inference on a single example on a K80 GPU (Google Colab), the average runtime of a generate() call for a single example (the one in the transformers documentation) with t5-base in v3.4.0 is 539 ± 13 ms, while the runtime for v4.2.1 is 627 ± 13 ms.
On t5-large, the difference is 1004 ± 22 ms, compared to 1242 ± 15 ms.

I made two colab notebooks that compare the two versions:

I’m aware of a at least one bug fix that was made to the attention mechanism of T5 in v4.0.0 (#8158), but I don’t think this change should have caused such a degradation.
Any idea why such a degradation occurred?


Code snippet just in case:

import torch
import time
import numpy as np
from transformers import T5TokenizerFast, T5ForConditionalGeneration
from transformers import version as transformers_version
from torch import version as torch_version

device = torch.device(‘cuda:0’) if torch.cuda.is_available() else torch.device(‘cpu’)
print(f"Using device: {device}")
t5_tokenizer = T5TokenizerFast.from_pretrained(‘t5-base’)
t5_model = T5ForConditionalGeneration.from_pretrained(‘t5-base’)
t5_model = t5_model.to(device)
t5_input_ids = t5_tokenizer("summarize: studies have shown that owning a dog is good for you ", return_tensors=“pt”).input_ids # Batch size 1
t5_input_ids = t5_input_ids.to(device)

N = 100
times = []
for _ in range(N):
start = time.time()
t5_outputs = t5_model.generate(t5_input_ids)
end = time.time()
print(f"transformers version: {transformers_version}")
print(f"torch version: {torch_version}")
print(f"{1000np.mean(times):.0f} ms \u00B1 {1000np.std(times):.2f} ms per loop (mean \u00B1 std of {N} runs)")