Hello,
Iāve noticed that the running time of T5 on a GPU has increased between v3.4.0 and the current version (v4.2.1). When running inference on a single example on a K80 GPU (Google Colab), the average runtime of a generate() call for a single example (the one in the transformers documentation) with t5-base in v3.4.0 is 539 ± 13 ms, while the runtime for v4.2.1 is 627 ± 13 ms.
On t5-large, the difference is 1004 ± 22 ms, compared to 1242 ± 15 ms.
I made two colab notebooks that compare the two versions:
Iām aware of a at least one bug fix that was made to the attention mechanism of T5 in v4.0.0 (#8158), but I donāt think this change should have caused such a degradation.
Any idea why such a degradation occurred?
Thanks!
Code snippet just in case:
import torch
import time
import numpy as np
from transformers import T5TokenizerFast, T5ForConditionalGeneration
from transformers import version as transformers_version
from torch import version as torch_versiondevice = torch.device(ācuda:0ā) if torch.cuda.is_available() else torch.device(ācpuā)
print(f"Using device: {device}")
t5_tokenizer = T5TokenizerFast.from_pretrained(āt5-baseā)
t5_model = T5ForConditionalGeneration.from_pretrained(āt5-baseā)
t5_model = t5_model.to(device)
t5_input_ids = t5_tokenizer("summarize: studies have shown that owning a dog is good for you ", return_tensors=āptā).input_ids # Batch size 1
t5_input_ids = t5_input_ids.to(device)N = 100
times =
for _ in range(N):
start = time.time()
t5_outputs = t5_model.generate(t5_input_ids)
end = time.time()
times.append(end-start)
print(f"transformers version: {transformers_version}ā)
print(f"torch version: {torch_version}ā)
print(f"{1000np.mean(times):.0f} ms \u00B1 {1000np.std(times):.2f} ms per loop (mean \u00B1 std of {N} runs)")