Performing Whisper's "transcribe" with Transformer pipelines

I have the following script:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import time

# Get free TF32 performance increase if the GPU supports it
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# https://huggingface.co/distil-whisper/distil-large-v2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# https://huggingface.co/docs/transformers/perf_train_gpu_one
model_id = 'distil-whisper/distil-large-v2'
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, cache_dir='data', attn_implementation='flash_attention_2',
).to(device)
processor = AutoProcessor.from_pretrained(model_id, cache_dir='data')

# Settings for audio less than 30 seconds
pipe = pipeline(
    'automatic-speech-recognition',
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset('lj_speech', 'main', split='train', cache_dir='data')
sample = dataset[0]['audio']

start_time = time.time()
result = pipe(sample)
print('--- %s seconds ---' % (time.time() - start_time))
print(result['text'].strip())

Producing the following output:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--- 0.6019861698150635 seconds ---
Printing, in the only sense with which we are at present concerned, differs from most, if not from all, the arts and crafts represented in the exhibition.

Side Quest: If the first warning message could be fixed that’d be great!

Next I’d like to be able to pass a custom prompt to the pipeline and apply a custom vocabulary without having to completely retrain or fine-tune the model. In Whisper’s transcribe method, it performs the following:

all_tokens = []
all_segments = []
prompt_reset_since = 0

if initial_prompt is not None:
    initial_prompt_tokens = tokenizer.encode(" " + initial_prompt.strip())
    all_tokens.extend(initial_prompt_tokens)
else:
    initial_prompt_tokens = []

So naturally I thought my solution would be to tinker with the pipeline’s tokenizer and pass in my own words to become tokens. I’ve tried using the example given in the documentation like so:

processor.tokenizer.add_tokens(['my', 'terms'])
model.resize_token_embeddings(len(processor.tokenizer))

But it doesn’t work. What am I missing?

For your side quest; Add a torch.set_default_device(device) prior to loading the model. The message is harmless anyway as you are moving the model to a device after creating it.

For additional tokens - I believe you have to train, finetune because adding tokens causes the model to lose some of its weights.

Thanks! I discovered the device_map model parameter to fix the warning, so I’ve got that completed! As for the other recommendations I’ve found that info during my own research. I’m trying to do something that’s faster than just using the transcribe method, which is what spurred on this whole line of discovery.

I’ve done this technique:

# new tokens
new_tokens = ['my', 'terms']

# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))

# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))

After initializing the tokenizer and before initializing the pipeline, and it didn’t work.

If I wanted to do an OpenAI-esque attempt using an initial_prompt what would that look like using pipelines?

Another solution I’m looking into is creating a text-to-text pipeline that takes the output from the previous one and uses another AI to spellcheck or insert my custom terms if there are any that could accomplish it.