I have a question regarding pre-training of the Wav2vec 2.0 model. In the original paper it says that “[…] we sample p=0.065 of all time-steps to be starting indices and mask the subsequent M = 10 time-steps. This results in approximately 49% of all time steps to be masked […]” (section 4.2).
However, in the HF wav2vec2-base config.json file, the variable mask_time_prob is set to 0.05. This variable is passed to
_compute_mask_indices in the Wav2vec2Model method
_mask_hidden_states (here) and is documented as:
The percentage of the whole axis (between 0 and 1) which will be masked. The number of independently generated mask spans of length
mask_lengthis computed by
mask_prob*shape/mask_length. Note that due to overlaps, mask_prob is an upper bound and the actual percentage will be smaller.
I’m confused as to why mask_time_prob is set to 0.05 and not 0.5 if the authors state that approximately 49% of all time steps are masked?
The reason I’m asking is because I’m pretraining a wav2vec2-base model on the Switchboard dataset using this pretraining script, but I’m having a hard time improving the pre-trained model – the loss fluctuates but doesn’t show a decreasing trend. I’m getting that ~0.043 of my data is being masked, and was wondering if maybe the model would learn better if this number was increased, and if so if it should really be closer to 50%?
Thankful for any thoughts and replies