I am running âoriginalâ and HF versions of the whisper medium model and get inconsistent output. In particular, there are a few utterances in which the HF medium model generates a lot of âHallucinationsâ but the PyTorch version does not:
Ref: BOOP BOOP BOOP BOOP BOOP BOOP
PyTorch: GO GO GO GO GO GO GO GO
HF: HYP: GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO
The speech is not great quality (It is a mother interacting with a baby, regrettably I can not share the waveform).
This happens on a few utterances and significantly changes the WER because of the massive number of insertions.
Here is the main code loop I am using for the non-HF model: (args.language = âenâ).
if args.language != ââ:
options = dict(language=args.language)
transcribe_options = dict(task=âtranscribeâ, **options)
for index, row in test_df.iterrows():
if (index % args.batches) != args.offset :
continue
if args.language != "":
pred_transcription = whisper.transcribe(model, row['path'], **transcribe_options)
else:
pred_transcription = whisper.transcribe(model, row['path'])
for segment in pred_transcription['segments'] :
del segment['tokens']
del segment['seek']
del segment['temperature']
del segment['compression_ratio']
segment['filename'] = row['path']
print(segment)
if args.results is not None:
print(segment, file=f)
and here is the main loop for the HF version:
args.whisper is set to âopenai/whisper-mediumâ
tokenizer = WhisperTokenizer.from_pretrained(args.whisper, language=âEnglishâ)
feature_extractor = WhisperFeatureExtractor.from_pretrained(args.whisper)
processor = WhisperProcessor.from_pretrained(args.whisper, language=âEnglishâ)
model = WhisperForConditionalGeneration.from_pretrained(args.whisper)
model.config.forced_decoder_ids = None
model.config.suppress_tokens =
forced_decoder_ids = processor.get_decoder_prompt_ids(language=âEnglishâ, task=âtranscribeâ)
def map_to_result(batch):
with torch.no_grad():
input_features = processor(batch[âaudioâ][âarrayâ], sampling_rate=batch[âaudioâ][âsampling_rateâ],return_tensors=âptâ).input_features
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
batch[âpred_strâ] = re.sub(chars_to_ignore_regex, ââ, transcription[0]).upper() + " "
batch[ârefâ] = batch[âtextâ]
return batch
results = test_dataset.map(map_to_result, remove_columns=test_dataset.column_names)
Does anyone have any insights on why the HF versions tends to hallucinate significantly more than the non-HF version, and also if there is a quick fix for the hallucination problem (I do see stuff out there but not sure exactly which parameters to set where).
Thanks
Michael