Pretrained wav2vec2 speech to text - decoded text is gibberish

I am trying to do inference using the pretrained wav2vec2-base-960H on a couple of audio files on English speakers conversation.

The decoded text looks like mostly gibberish.
I was wondering how come, maybe I am doing something wrong.

The code I’m using is the most basic inference from the > wav2vec github page example

# !pip install transformers
# !pip install datasets
import soundfile as sf
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# load pretrained model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

librispeech_samples_ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# load audio
audio_input, sample_rate =[0]["file"])

# pad input values and return pt tensor
input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values  


# retrieve logits & take argmax
logits = model(input_values).logits # this is the line where OOM appears
predicted_ids = torch.argmax(logits, dim=-1)

# transcribe
transcription = processor.decode(predicted_ids[0])

The output I get does not look like English words for a large part of it.

I cannot believe this is a normal behavior for a tool that got the WER reported in their research papers.

And many other tools are being benchmarked against wav2vec2, so I’m very surprised this is actually a state of the art transcription that get one of the lowest WERs.

As can be been in the image of the transcription variable contents, most words are not even real words.

Is there something that I am missing ? Does this transcription still need some processing ?