Confusing Benchmark results Running whisper on 4080 Super vs A10 vs H100

I’m trying to determine the latency I can expect when using the Whisper model to process hundreds of requests simultaneously. Ultimately, I want to ascertain the number of GPUs required to handle 100 simultaneous requests with less than 1 second of latency, assuming each request involves less than 30 seconds of audio.

I conducted tests on three different devices and obtained unexpected results. For these tests, I used Hugging Face’s transformers library, which includes flash attention and batch processing features. I specifically used the distil-whisper/distil-medium.en model to expedite inference times. The task was to recognize 50 long .wav audio files (30 seconds each) and 50 short .wav audio files (3 seconds each).

I tested the setup in three different environments:

  • RTX 4080 Super + Windows (local machine): Fastest time was 15 seconds with a batch size of at least 16.
  • Nvidia A10 + Ubuntu (Lambda Labs): Fastest time was 24 seconds with a batch size of at least 8 (run in a Jupyter notebook).
  • Nvidia H100 + Ubuntu (Lambda Labs): Fastest time was 23 seconds with a batch size of at least 64 (run in a Jupyter notebook IPython environment).

These results are confusing, considering that the computation mainly relies on fp16. The RTX 4080 Super, A10, and H100 have fp16 computational capabilities of 52.22 TFLOPS, 125 TFLOPS, and 1513 TFLOPS, respectively. I expected the model inference speeds to rank as follows: H100 > A10 > RTX 4080. Unless I’ve misunderstood something, these results don’t seem to align with the hardware capabilities.

Do the above results make sense, or could someone explain why this might be happening? Below is the code I used for testing.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import time




device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("device: " + device)
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-medium.en"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)














start_time = time.time()
filename_list = ['./download.wav', './download-long.wav'] * 50
# Process audio file
results = pipe(filename_list, batch_size=16)
end_time = time.time()
print("Time taken start to end (seconds):", end_time - start_time)
print([result["text"] for result in results])