Wav2vec2 feature timestamps? (not words)

When extracting features using wav2vec2, what are the timestamps? I know the hop-size is 20ms, but is the first embedding centered at 10ms? Some experiments I have done indicate that for “base”, it may start at 5ms:

  • 31439 samples => 97 embeddings
  • 31440 samples => 98 embeddings

Suggest that the embeddings aren’t centered. 320 samples is 20ms, and 31440/320 = 98.25, so the embeddings seem to be shifted 5 ms.

However, 31680 sample => 98 embeddings, and 31680 samples = 99 20 ms hops, which suggests the timestamps start at 20ms.

It would be useful to document this, even if there is no code for it.

1 Like

Were you able to find a solution for this?