Wav2Vec2Phoneme phoneme label length seem off

I’m using a wav2vec2 model (and have tried multiple different versions) to transcribe phonemes and output their onset and offset. The model produces decent phoneme labels, but the duration of the phonemes seem off. The phonemes mostly last 20 ms according to the model (a few last 40 ms), so most words consist of 20 ms of a label, 0-100 ms of nothing, 20 ms of another label, 0-100 ms of nothing, and so on. Is there a reason why the model outputs such short durations of the phonemes, and is there a way to ‘fix’ it?