I am using a wav2vec2 model with default parameters. Hence the inputs to logits ratio is at 320.
If I feed my model one second of audio at 16kHz, I expect to get 16000/320 = 50 logits.
Surprisingly I only get 49. This means that the last moments have not been transcribed, if I’m not mistaken.
This is an issue when using Wav2Vec2 for streaming with short audio chunks:
If the full audio lasts 10 secs and we transcribe chunks of 1 sec:
We expect 500 logits but will only get 49*10 = 490 logits. Meaning some letters in the middle of the full transcript may be missing.
I suspect this is an issue with the convolutionnal layers not having padding.
Is there any way I can fix this ? Something like adding padding for the conv layers maybe, but I haven’t found a config parameter to do so