Using past_key_values to provide context to decoder results in same output

vctrymao · December 23, 2023, 9:02pm

I am trying to generate text from audio using Whisper via Hugging Face. I want to be able to pass in prior generated text into the model to be used as context (but not to be regenerated in the output). My understanding is that I could use past_key_values to do this, but it seems like when I do pass it in as an argument, the generated text is nothing but what I generated in past_key_values, and the actual input text is not generated.

This is my completely reproducible output:

>>> import torch
>>> from transformers import AutoProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset
>>> 
>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
>>> 
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> 
>>> inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")
It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
>>> input_features = inputs.input_features
>>> generated_ids = model.generate(inputs=input_features, return_dict_in_generate=True)
>>> processor.batch_decode(generated_ids.sequences, skip_special_tokens=True)[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
>>> past = generated_ids.past_key_values
>>> inputs = processor(ds[1]["audio"]["array"], return_tensors="pt")
It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
>>> input_features = inputs.input_features
>>> generated_ids = model.generate(inputs=input_features, past_key_values=past)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
>>> generated_ids = model.generate(inputs=input_features)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
" Nor is Mr. Quilter's manner less interesting than his matter."

You can see from the last few lines:

>>> generated_ids = model.generate(inputs=input_features, past_key_values=past)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
>>> generated_ids = model.generate(inputs=input_features)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
" Nor is Mr. Quilter's manner less interesting than his matter."

that the output of the same input depends completely on what I pass into past_key_values, and the actual output should be the last line. Perhaps I am misunderstanding the use case, but is there a reason why it doesn’t seem to be able to generate the actual output?

Topic		Replies	Views
Why past_key_values is not in GreedySearchDecoderOnlyOutput? 🤗Transformers	1	1541	October 4, 2022
Deploy whisper by passing last transcribed sentences to decoder's past_key values 🤗Transformers	0	276	March 20, 2023
Adding prompt / context to Whisper with Huggingface Transformers Models	6	5245	April 30, 2024
Forge synthetic past_key_value batch from multiple outputs Intermediate	0	411	May 12, 2021
Decode whisper logits to transcript using forward instead of generate method 🤗Transformers	3	1542	December 20, 2022

Using past_key_values to provide context to decoder results in same output

Related Topics