I am trying to do inference using the pretrained wav2vec2-base-960H on a couple of audio files on English speakers conversation.
The decoded text looks like mostly gibberish.
I was wondering how come, maybe I am doing something wrong.
The code I’m using is the most basic inference from the > wav2vec github page example
# !pip install transformers # !pip install datasets import soundfile as sf import torch from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor # load pretrained model processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") librispeech_samples_ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") # load audio audio_input, sample_rate = sf.read(librispeech_samples_ds["file"]) # pad input values and return pt tensor input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values # INFERENCE # retrieve logits & take argmax logits = model(input_values).logits # this is the line where OOM appears predicted_ids = torch.argmax(logits, dim=-1) # transcribe transcription = processor.decode(predicted_ids)
The output I get does not look like English words for a large part of it.
I cannot believe this is a normal behavior for a tool that got the WER reported in their research papers.
And many other tools are being benchmarked against wav2vec2, so I’m very surprised this is actually a state of the art transcription that get one of the lowest WERs.
As can be been in the image of the
transcription variable contents, most words are not even real words.
Is there something that I am missing ? Does this transcription still need some processing ?