In wav2vec2 why are the basic learned units are learning basic units are 25ms long?

According to: Wav2vec 2.0: Learning the structure of speech from raw audio

Wav2vec 2.0 tackles this issue by learning basic units that are 25ms long to enable learning of high-level contextualized representations.


The model first processes the raw waveform of the speech audio with a multilayer convolutional neural network to get latent audio representations of 25ms each.

  1. Why they used 25ms and not 20ms or 30ms ?
  2. To be sure I understand correctly, if the length of the input wav file to the wav2vec2 model was 3.4 seconds, the model (the conv layers) will split it to 136 pieces ? (3.4 * 1000 /25) ?