Wav2vec2 config -- why is mask_time_prob=0.05 and not 0.5?

Hi everyone!

I have a question regarding pre-training of the Wav2vec 2.0 model. In the original paper it says that “[…] we sample p=0.065 of all time-steps to be starting indices and mask the subsequent M = 10 time-steps. This results in approximately 49% of all time steps to be masked […]” (section 4.2).

However, in the HF wav2vec2-base config.json file, the variable mask_time_prob is set to 0.05. This variable is passed to _compute_mask_indices in the Wav2vec2Model method _mask_hidden_states (here) and is documented as:

The percentage of the whole axis (between 0 and 1) which will be masked. The number of independently generated mask spans of length mask_length is computed by mask_prob*shape[1]/mask_length. Note that due to overlaps, mask_prob is an upper bound and the actual percentage will be smaller.

I’m confused as to why mask_time_prob is set to 0.05 and not 0.5 if the authors state that approximately 49% of all time steps are masked?

The reason I’m asking is because I’m pretraining a wav2vec2-base model on the Switchboard dataset using this pretraining script, but I’m having a hard time improving the pre-trained model – the loss fluctuates but doesn’t show a decreasing trend. I’m getting that ~0.043 of my data is being masked, and was wondering if maybe the model would learn better if this number was increased, and if so if it should really be closer to 50%?

Thankful for any thoughts and replies
Magnus

1 Like

3 years later…
Yeah you were absolutely correct. It’s still the default but one should increase that value if you want it to do anything… Maybe it was a typo but there are other useless defaults there as well so I dunno. I use spec augmentation after the first epoch. For best results though I use old fashioned augmentation (for whisper).

example:

from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift

augment_waveform = Compose([
AddGaussianNoise(min_amplitude=0.005, max_amplitude=0.015, p=0.2),
TimeStretch(min_rate=0.8, max_rate=1.25, p=0.2, leave_length_unchanged=False),
PitchShift(min_semitones=-4, max_semitones=4, p=0.2)
,])

def augment_dataset(batch):
audio = batch[“audio”][“array”]
augmented_audio = augment_waveform(samples=audio, sample_rate=16000)
batch[“audio”][“array”] = augmented_audio
return batch

dataset = dataset.map(augment_dataset)

or you could just added it into the data proc mapping…

def prepare_dataset(batch):
audio = batch[“audio”]#
augmented_audio = augment_waveform(samples=audio[“array”], sample_rate=16000) #
batch[“audio”] = augmented_audio
batch[“input_features”] = processor.feature_extractor(audio[“array”], sampling_rate=audio[“sampling_rate”]).input_features[0]
batch[“labels”] = processor.tokenizer(batch[“sentence”]).input_ids
return batch

either way…