I have the following script:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import time
# Get free TF32 performance increase if the GPU supports it
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# https://huggingface.co/distil-whisper/distil-large-v2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# https://huggingface.co/docs/transformers/perf_train_gpu_one
model_id = 'distil-whisper/distil-large-v2'
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, cache_dir='data', attn_implementation='flash_attention_2',
).to(device)
processor = AutoProcessor.from_pretrained(model_id, cache_dir='data')
# Settings for audio less than 30 seconds
pipe = pipeline(
'automatic-speech-recognition',
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset('lj_speech', 'main', split='train', cache_dir='data')
sample = dataset[0]['audio']
start_time = time.time()
result = pipe(sample)
print('--- %s seconds ---' % (time.time() - start_time))
print(result['text'].strip())
Producing the following output:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--- 0.6019861698150635 seconds ---
Printing, in the only sense with which we are at present concerned, differs from most, if not from all, the arts and crafts represented in the exhibition.
Side Quest: If the first warning message could be fixed that’d be great!
Next I’d like to be able to pass a custom prompt to the pipeline and apply a custom vocabulary without having to completely retrain or fine-tune the model. In Whisper’s transcribe
method, it performs the following:
all_tokens = []
all_segments = []
prompt_reset_since = 0
if initial_prompt is not None:
initial_prompt_tokens = tokenizer.encode(" " + initial_prompt.strip())
all_tokens.extend(initial_prompt_tokens)
else:
initial_prompt_tokens = []
So naturally I thought my solution would be to tinker with the pipeline’s tokenizer and pass in my own words to become tokens. I’ve tried using the example given in the documentation like so:
processor.tokenizer.add_tokens(['my', 'terms'])
model.resize_token_embeddings(len(processor.tokenizer))
But it doesn’t work. What am I missing?