Wav2Vec2ForCTC abandons one logit sometimes

BrunoHays · September 13, 2022, 1:55pm

Hello,

I am using a wav2vec2 model with default parameters. Hence the inputs to logits ratio is at 320.
If I feed my model one second of audio at 16kHz, I expect to get 16000/320 = 50 logits.
Surprisingly I only get 49. This means that the last moments have not been transcribed, if I’m not mistaken.

This is an issue when using Wav2Vec2 for streaming with short audio chunks:

If the full audio lasts 10 secs and we transcribe chunks of 1 sec:
We expect 500 logits but will only get 49*10 = 490 logits. Meaning some letters in the middle of the full transcript may be missing.

I suspect this is an issue with the convolutionnal layers not having padding.

Is there any way I can fix this ? Something like adding padding for the conv layers maybe, but I haven’t found a config parameter to do so

Thanks

dmurillo976 · October 26, 2022, 10:41pm

Hi,

Have you found anything else regarding this? I noticed something similar yesterday, but wasn’t really sure if my calculations were wrong or if it really was an issue haha.

I’m trying to obtain the effective output lengths of my model (number of logits excluding the ones added by the input padding). The following basic code can replicate my efforts until now:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

device = "cuda"

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-robust-ft-swbd-300h")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-robust-ft-swbd-300h")

model.to(device)

dummy_audio = torch.rand((10 * 16000,)).tolist() # dummy data simulating a 10s audio

inputs = processor(
    dummy_audio, sampling_rate=16000, return_tensors="pt"
)

input_values = inputs.input_values.to(device)
attention_mask = inputs.attention_mask.to(device)

with torch.no_grad():
    logits = model(input_values, attention_mask=attention_mask).logits.to("cpu")


# Approach 1, divide effective input length by given ratio
output_length_1 = attention_mask.sum(dim=-1) / model.config.inputs_to_logits_ratio

# Approach 2, use inner method of model
output_length_2 = model._get_feat_extract_output_lengths(attention_mask.sum(-1)).to(torch.long)

print(f"Expected number of effective logits with approach 1: {output_length_1}")
print(f"Expected number of effective logits with approach 2: {output_length_2}")
print(f"Actual number of effective logits in output: {logits.size()}")

The code returns:

Expected number of effective logits with approach 1: tensor([500.], device='cuda:0')
Expected number of effective logits with approach 2: tensor([499], device='cuda:0')
Actual number of effective logits in output: torch.Size([1, 499, 32])

I think the first approach (follows same reasoning as yours) is the most correct. However I too noticed that the number of returned logits was off by 1. As such I am currently using the second approach, but I am interested in knowing more about this issue and if it could be solved.

Topic		Replies	Views
Batch input for wav2vec2 pretraining Beginners	1	370	July 15, 2021
Wav2Vec2: loss growing in training and validation after few epochs Models	6	2049	September 25, 2024
Wav2vec2 not converging when finetuning 🤗Transformers	7	2542	June 15, 2021
Question about Wav2vec2 Models	1	545	May 6, 2022
Decoding the logits provided by a tiny Wav2vec2 model gives sequences that do not make sense Beginners	0	245	October 25, 2022

Wav2Vec2ForCTC abandons one logit sometimes

Related topics