Wav2Vec2ForCTC abandons one logit sometimes


I am using a wav2vec2 model with default parameters. Hence the inputs to logits ratio is at 320.
If I feed my model one second of audio at 16kHz, I expect to get 16000/320 = 50 logits.
Surprisingly I only get 49. This means that the last moments have not been transcribed, if I’m not mistaken.

This is an issue when using Wav2Vec2 for streaming with short audio chunks:

  • If the full audio lasts 10 secs and we transcribe chunks of 1 sec:

  • We expect 500 logits but will only get 49*10 = 490 logits. Meaning some letters in the middle of the full transcript may be missing.

I suspect this is an issue with the convolutionnal layers not having padding.

Is there any way I can fix this ? Something like adding padding for the conv layers maybe, but I haven’t found a config parameter to do so


1 Like


Have you found anything else regarding this? I noticed something similar yesterday, but wasn’t really sure if my calculations were wrong or if it really was an issue haha.

I’m trying to obtain the effective output lengths of my model (number of logits excluding the ones added by the input padding). The following basic code can replicate my efforts until now:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

device = "cuda"

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-robust-ft-swbd-300h")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-robust-ft-swbd-300h")


dummy_audio = torch.rand((10 * 16000,)).tolist() # dummy data simulating a 10s audio

inputs = processor(
    dummy_audio, sampling_rate=16000, return_tensors="pt"

input_values = inputs.input_values.to(device)
attention_mask = inputs.attention_mask.to(device)

with torch.no_grad():
    logits = model(input_values, attention_mask=attention_mask).logits.to("cpu")

# Approach 1, divide effective input length by given ratio
output_length_1 = attention_mask.sum(dim=-1) / model.config.inputs_to_logits_ratio

# Approach 2, use inner method of model
output_length_2 = model._get_feat_extract_output_lengths(attention_mask.sum(-1)).to(torch.long)

print(f"Expected number of effective logits with approach 1: {output_length_1}")
print(f"Expected number of effective logits with approach 2: {output_length_2}")
print(f"Actual number of effective logits in output: {logits.size()}")

The code returns:

Expected number of effective logits with approach 1: tensor([500.], device='cuda:0')
Expected number of effective logits with approach 2: tensor([499], device='cuda:0')
Actual number of effective logits in output: torch.Size([1, 499, 32])

I think the first approach (follows same reasoning as yours) is the most correct. However I too noticed that the number of returned logits was off by 1. As such I am currently using the second approach, but I am interested in knowing more about this issue and if it could be solved.