Pretrained wav2vec2 speech to text - decoded text is gibberish

mihaelagrigore · June 12, 2023, 4:25pm

I am trying to do inference using the pretrained wav2vec2-base-960H on a couple of audio files on English speakers conversation.

The decoded text looks like mostly gibberish.
I was wondering how come, maybe I am doing something wrong.

The code I’m using is the most basic inference from the > wav2vec github page example

# !pip install transformers
# !pip install datasets
import soundfile as sf
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# load pretrained model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")


librispeech_samples_ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# load audio
audio_input, sample_rate = sf.read(librispeech_samples_ds[0]["file"])

# pad input values and return pt tensor
input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values  

# INFERENCE

# retrieve logits & take argmax
logits = model(input_values).logits # this is the line where OOM appears
predicted_ids = torch.argmax(logits, dim=-1)

# transcribe
transcription = processor.decode(predicted_ids[0])

The output I get does not look like English words for a large part of it.

I cannot believe this is a normal behavior for a tool that got the WER reported in their research papers.

And many other tools are being benchmarked against wav2vec2, so I’m very surprised this is actually a state of the art transcription that get one of the lowest WERs.

As can be been in the image of the transcription variable contents, most words are not even real words.

Is there something that I am missing ? Does this transcription still need some processing ?

Topic		Replies	Views
Wav2Vec2ForCTC not working for my own wav file 🤗Transformers	0	873	November 22, 2021
[STT] Using huggingface pretrained models but different results =>Wav2Vec2 vs PatrickDemo 🤗Transformers	0	445	December 27, 2021
Correct Wav2Vec2 ASR output Beginners	0	129	December 21, 2023
AttributeError: 'Wav2Vec2FeatureExtractor' object has no attribute 'decode' Models	0	579	February 24, 2023
Wav2vec2.0 memory issue for basic inference Models	1	629	June 12, 2023

Pretrained wav2vec2 speech to text - decoded text is gibberish

Related topics