I am trying to generate text from audio using Whisper via Hugging Face. I want to be able to pass in prior generated text into the model to be used as context (but not to be regenerated in the output). My understanding is that I could use past_key_values
to do this, but it seems like when I do pass it in as an argument, the generated text is nothing but what I generated in past_key_values
, and the actual input text is not generated.
This is my completely reproducible output:
>>> import torch
>>> from transformers import AutoProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset
>>>
>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
>>>
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>>
>>> inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")
It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
>>> input_features = inputs.input_features
>>> generated_ids = model.generate(inputs=input_features, return_dict_in_generate=True)
>>> processor.batch_decode(generated_ids.sequences, skip_special_tokens=True)[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
>>> past = generated_ids.past_key_values
>>> inputs = processor(ds[1]["audio"]["array"], return_tensors="pt")
It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
>>> input_features = inputs.input_features
>>> generated_ids = model.generate(inputs=input_features, past_key_values=past)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
>>> generated_ids = model.generate(inputs=input_features)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
" Nor is Mr. Quilter's manner less interesting than his matter."
You can see from the last few lines:
>>> generated_ids = model.generate(inputs=input_features, past_key_values=past)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
>>> generated_ids = model.generate(inputs=input_features)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
" Nor is Mr. Quilter's manner less interesting than his matter."
that the output of the same input depends completely on what I pass into past_key_values
, and the actual output should be the last line. Perhaps I am misunderstanding the use case, but is there a reason why it doesn’t seem to be able to generate the actual output?