According to: Wav2vec 2.0: Learning the structure of speech from raw audio
Wav2vec 2.0 tackles this issue by learning basic units that are 25ms long to enable learning of high-level contextualized representations.
and
The model first processes the raw waveform of the speech audio with a multilayer convolutional neural network to get latent audio representations of 25ms each.
- Why they used 25ms and not 20ms or 30ms ?
- To be sure I understand correctly, if the length of the input wav file to the wav2vec2 model was 3.4 seconds, the model (the conv layers) will split it to 136 pieces ? (3.4 * 1000 /25) ?