When extracting features using wav2vec2, what are the timestamps? I know the hop-size is 20ms, but is the first embedding centered at 10ms? Some experiments I have done indicate that for “base”, it may start at 5ms:
- 31439 samples => 97 embeddings
- 31440 samples => 98 embeddings
Suggest that the embeddings aren’t centered. 320 samples is 20ms, and 31440/320 = 98.25, so the embeddings seem to be shifted 5 ms.
However, 31680 sample => 98 embeddings, and 31680 samples = 99 20 ms hops, which suggests the timestamps start at 20ms.
It would be useful to document this, even if there is no code for it.