Wav2vec2 config -- why is mask_time_prob=0.05 and not 0.5?

mpierrau · October 27, 2022, 9:30am

Hi everyone!

I have a question regarding pre-training of the Wav2vec 2.0 model. In the original paper it says that “[…] we sample p=0.065 of all time-steps to be starting indices and mask the subsequent M = 10 time-steps. This results in approximately 49% of all time steps to be masked […]” (section 4.2).

However, in the HF wav2vec2-base config.json file, the variable mask_time_prob is set to 0.05. This variable is passed to _compute_mask_indices in the Wav2vec2Model method _mask_hidden_states (here) and is documented as:

The percentage of the whole axis (between 0 and 1) which will be masked. The number of independently generated mask spans of length mask_length is computed by mask_prob*shape[1]/mask_length. Note that due to overlaps, mask_prob is an upper bound and the actual percentage will be smaller.

I’m confused as to why mask_time_prob is set to 0.05 and not 0.5 if the authors state that approximately 49% of all time steps are masked?

The reason I’m asking is because I’m pretraining a wav2vec2-base model on the Switchboard dataset using this pretraining script, but I’m having a hard time improving the pre-trained model – the loss fluctuates but doesn’t show a decreasing trend. I’m getting that ~0.043 of my data is being masked, and was wondering if maybe the model would learn better if this number was increased, and if so if it should really be closer to 50%?

Thankful for any thoughts and replies
Magnus

sin2piusc · September 26, 2024, 7:03pm

3 years later…
Yeah you were absolutely correct. It’s still the default but one should increase that value if you want it to do anything… Maybe it was a typo but there are other useless defaults there as well so I dunno. I use spec augmentation after the first epoch. For best results though I use old fashioned augmentation (for whisper).

example:

from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift

augment_waveform = Compose([
AddGaussianNoise(min_amplitude=0.005, max_amplitude=0.015, p=0.2),
TimeStretch(min_rate=0.8, max_rate=1.25, p=0.2, leave_length_unchanged=False),
PitchShift(min_semitones=-4, max_semitones=4, p=0.2)
,])

def augment_dataset(batch):
audio = batch[“audio”][“array”]
augmented_audio = augment_waveform(samples=audio, sample_rate=16000)
batch[“audio”][“array”] = augmented_audio
return batch

dataset = dataset.map(augment_dataset)

or you could just added it into the data proc mapping…

def prepare_dataset(batch):
audio = batch[“audio”]#
augmented_audio = augment_waveform(samples=audio[“array”], sample_rate=16000) #
batch[“audio”] = augmented_audio
batch[“input_features”] = processor.feature_extractor(audio[“array”], sampling_rate=audio[“sampling_rate”]).input_features[0]
batch[“labels”] = processor.tokenizer(batch[“sentence”]).input_ids
return batch

either way…

Topic		Replies	Views
Wav2Vec2: loss growing in training and validation after few epochs Models	6	2047	September 25, 2024
Why is Wav2Vec pretraining loss not decreasing? Models	10	2645	April 29, 2022
Collapsing Wav2Vec2 pretraining loss Beginners	2	760	April 3, 2023
ValueError: `mask_length` has to be smaller than `sequence_length`, while finetuning Wav2vec2.0 🤗Transformers	4	2316	September 20, 2023
Pre-training for Wav2Vec2-XLSR via Huggingface Models	15	5354	November 5, 2024

Wav2vec2 config -- why is mask_time_prob=0.05 and not 0.5?

Related topics