I’m running simple wav2vec2 code on short without noise voice:
#processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
FILE_NAME = "tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"
SPEECH_FILE = download_asset(FILE_NAME)
speech, sr = librosa.load(SPEECH_FILE, sr=16000)
speech = torch.tensor(speech)
speech = speech.reshape(1, -1)
logits = model(speech).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
transcription
results: 'I HAD THAT CURIOSITY BESIDE ME AT THIS MOMENT'
- As you see, I didn’t use
processor
. - The examples on the net always used
processor
So:
- What is the benefit of using processor ?
- When do we need to use it ?