ValueError: `mask_length` has to be smaller than `sequence_length`, while finetuning Wav2vec2.0

As in the title, I’m getting ValueError saying that mask length is longer than sequence length while finetuning wav2vec2 models. So far, I’ve tried wav2vec2-base and wav2vec2-large-xlsr-53 and the same error occurred for both of them.

I am getting around this error by filtering out examples short than a certain length. It seemed the error is coming from the model predicting some sequences to be shorter than a mask length and failing in the forward pass of the training loop.

Is this the desired behavior for wav2vec? It seems reasonable in some sense as (english) words can’t be too short. However, because of this, I’m dropping >50% of my data.

Did anyone had the same issue? If so, how did you get around this error? Below is the full traceback of the error:

    Traceback (most recent call last):
    File "/home/admin/projects/voice-assessment/wav2vec/kr/run.py", line 165, in <module>
        trainer.train()
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/transformers/trainer.py", line 1269, in train
        tr_loss += self.training_step(model, inputs)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/transformers/trainer.py", line 1760, in training_step
        loss = self.compute_loss(model, inputs)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/transformers/trainer.py", line 1794, in compute_loss
        outputs = model(**inputs)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
        outputs = self.parallel_apply(replicas, inputs, kwargs)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
        return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
        output.reraise()
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/torch/_utils.py", line 425, in reraise
        raise self.exc_type(msg)
    ValueError: Caught ValueError in replica 0 on device 0.
    Original Traceback (most recent call last):
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
        output = module(*input, **kwargs)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1467, in forward
        outputs = self.wav2vec2(
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1067, in forward
        hidden_states = self._mask_hidden_states(hidden_states)
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 982, in _mask_hidden_states
        mask_time_indices = _compute_mask_indices(
    File "/home/admin/.miniconda3/envs/voice/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 146, in _compute_mask_indices
        raise ValueError(
    ValueError: `mask_length` has to be smaller than `sequence_length`, but got `mask_length`: 10 and `sequence_length`: 9`
    '''

cc @patrickvonplaten

Maybe your audio is too short, you can try reducing the parameter mask_time_length in config.json, or setting mask_time_prob to 0

Yes, it turns out I was slicing my speeches with timestamps in milliseconds instead of in frequencies. That made my speech signals a bit too short. Everything works fine after fixing that!