Inference on a sample file takes much longer (5x) if whisper-large-v3 is loaded in 8bit mode on NVIDIA T4 gpu.
Shouldn’t quantization improve inference speed on GPU?
https://pytorch.org/docs/stable/quantization.html
Also, GPU utilization is at 33% in nvidia-smi
.
similar question:
model_8bit = AutoModelForSpeechSeq2Seq.from_pretrained(
"openai/whisper-large-v3",
device_map='auto',
load_in_8bit=True)
sample = "sample.mp3" #27s long
from transformers.pipelines.audio_classification import ffmpeg_read
with torch.inference_mode():
with open(sample, "rb") as f:
inputs = f.read()
inputs = ffmpeg_read(inputs, feature_extractor.sampling_rate)
input_features = feature_extractor(inputs, sampling_rate = feature_extractor.sampling_rate, return_tensors='pt')['input_features']
input_features = torch.tensor(input_features, dtype=torch.float16, device='cuda')
forced_decoder_ids_output = model_8bit.generate(input_features=input_features, return_timestamps=False)
out = tokenizer.decode(forced_decoder_ids_output.squeeze())
print(out)