Hi the great comminity,
I was trying to test Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
on my own wav files (generated by pyaudio) and the generated transcription is far away from what I expected. However when I tested it with the samples downloaded from this page (facebook/wav2vec2-base-960h 路 Hugging Face), it works really well no matter if it is a flac or wav file. I actually also tried uploading my own audio file to the demo page (facebook/wav2vec2-base-960h 路 Hugging Face) and it worked very well. I am wondering if there are any pre-processing steps I missed that the HF sever side does take before the audio is read and fed to the model?
Below is my code (the demo.wav is my own audio file) and I appreciate if any pointers:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import soundfile as sf
import torch
#load model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
#load audio
audio_input, sample_rate = sf.read("demo.wav")
#transcribe
input_values = tokenizer(audio_input, sample_rate=sample_rate,return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]
print(str(transcription))