Inconsistent output between PyTorch and HF whisper medium models

I am running “original” and HF versions of the whisper medium model and get inconsistent output. In particular, there are a few utterances in which the HF medium model generates a lot of “Hallucinations” but the PyTorch version does not:

Ref: BOOP BOOP BOOP BOOP BOOP BOOP
PyTorch: GO GO GO GO GO GO GO GO
HF: HYP: GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO GO

The speech is not great quality (It is a mother interacting with a baby, regrettably I can not share the waveform).

This happens on a few utterances and significantly changes the WER because of the massive number of insertions.

Here is the main code loop I am using for the non-HF model: (args.language = ‘en’).

if args.language != ‘’:
options = dict(language=args.language)
transcribe_options = dict(task=“transcribe”, **options)

for index, row in test_df.iterrows():

if (index % args.batches) != args.offset :
    continue
if args.language != "":
    pred_transcription = whisper.transcribe(model, row['path'], **transcribe_options)
else:
    pred_transcription = whisper.transcribe(model, row['path'])

for segment in pred_transcription['segments'] :
    del segment['tokens']
    del segment['seek']
    del segment['temperature']
    del segment['compression_ratio']
    segment['filename'] = row['path']
    print(segment)
    if args.results is not None:
        print(segment, file=f)

and here is the main loop for the HF version:

args.whisper is set to “openai/whisper-medium”

tokenizer = WhisperTokenizer.from_pretrained(args.whisper, language=“English”)
feature_extractor = WhisperFeatureExtractor.from_pretrained(args.whisper)
processor = WhisperProcessor.from_pretrained(args.whisper, language=“English”)

model = WhisperForConditionalGeneration.from_pretrained(args.whisper)

model.config.forced_decoder_ids = None
model.config.suppress_tokens =

forced_decoder_ids = processor.get_decoder_prompt_ids(language=“English”, task=“transcribe”)

def map_to_result(batch):
with torch.no_grad():
input_features = processor(batch[“audio”][“array”], sampling_rate=batch[“audio”][“sampling_rate”],return_tensors=“pt”).input_features
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
batch[“pred_str”] = re.sub(chars_to_ignore_regex, ‘’, transcription[0]).upper() + " "
batch[“ref”] = batch[“text”]
return batch

results = test_dataset.map(map_to_result, remove_columns=test_dataset.column_names)

Does anyone have any insights on why the HF versions tends to hallucinate significantly more than the non-HF version, and also if there is a quick fix for the hallucination problem (I do see stuff out there but not sure exactly which parameters to set where).

Thanks
Michael