Performing Whisper's "transcribe" with Transformer pipelines

T145 · December 18, 2023, 8:57pm

I have the following script:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import time

# Get free TF32 performance increase if the GPU supports it
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# https://huggingface.co/distil-whisper/distil-large-v2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# https://huggingface.co/docs/transformers/perf_train_gpu_one
model_id = 'distil-whisper/distil-large-v2'
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, cache_dir='data', attn_implementation='flash_attention_2',
).to(device)
processor = AutoProcessor.from_pretrained(model_id, cache_dir='data')

# Settings for audio less than 30 seconds
pipe = pipeline(
    'automatic-speech-recognition',
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset('lj_speech', 'main', split='train', cache_dir='data')
sample = dataset[0]['audio']

start_time = time.time()
result = pipe(sample)
print('--- %s seconds ---' % (time.time() - start_time))
print(result['text'].strip())

Producing the following output:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--- 0.6019861698150635 seconds ---
Printing, in the only sense with which we are at present concerned, differs from most, if not from all, the arts and crafts represented in the exhibition.

Side Quest: If the first warning message could be fixed that’d be great!

Next I’d like to be able to pass a custom prompt to the pipeline and apply a custom vocabulary without having to completely retrain or fine-tune the model. In Whisper’s transcribe method, it performs the following:

all_tokens = []
all_segments = []
prompt_reset_since = 0

if initial_prompt is not None:
    initial_prompt_tokens = tokenizer.encode(" " + initial_prompt.strip())
    all_tokens.extend(initial_prompt_tokens)
else:
    initial_prompt_tokens = []

So naturally I thought my solution would be to tinker with the pipeline’s tokenizer and pass in my own words to become tokens. I’ve tried using the example given in the documentation like so:

processor.tokenizer.add_tokens(['my', 'terms'])
model.resize_token_embeddings(len(processor.tokenizer))

But it doesn’t work. What am I missing?

panigrah · December 19, 2023, 2:20am

For your side quest; Add a torch.set_default_device(device) prior to loading the model. The message is harmless anyway as you are moving the model to a device after creating it.

For additional tokens - I believe you have to train, finetune because adding tokens causes the model to lose some of its weights.

T145 · December 19, 2023, 3:08am

Thanks! I discovered the device_map model parameter to fix the warning, so I’ve got that completed! As for the other recommendations I’ve found that info during my own research. I’m trying to do something that’s faster than just using the transcribe method, which is what spurred on this whole line of discovery.

I’ve done this technique:

# new tokens
new_tokens = ['my', 'terms']

# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))

# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))

After initializing the tokenizer and before initializing the pipeline, and it didn’t work.

If I wanted to do an OpenAI-esque attempt using an initial_prompt what would that look like using pipelines?

Another solution I’m looking into is creating a text-to-text pipeline that takes the output from the previous one and uses another AI to spellcheck or insert my custom terms if there are any that could accomplish it.

Topic		Replies	Views
Difficulty running distil-whisper Beginners	0	104	May 30, 2024
Can't read in audio files for transcription Beginners	2	149	June 29, 2024
Problems tracing fine tuned whisper model to torchscript Beginners	1	266	June 27, 2024
How to use Whisper from huggingface for ASR DeepSpeed	0	515	June 21, 2023
How to fine-tune whisper on unsupported language? Beginners	1	44	October 12, 2024

Performing Whisper's "transcribe" with Transformer pipelines

Related topics