Wav2Vec2: Inner workings of the Trainer class

Hi all, I am following this guide in order to fine-tune the model for my dataset. Reading the documentation of Wav2Vec2ForCTC, it says that the argument attention_mask must only be passed when the model’s processor has config.return_attention_mask == True. The processor that is created in the blog post, has indeed its return_attention_mask argument set to True.

When setting up the trainer, does it take care of this by itself? Does it understand what arguments to use based on the settings? Or do I need to instruct it somehow to do it? By the way, as far as I understand the attention_mask is needed only when I input a batch of data into the model, not for single data points.

Thanks in advance.

The Trainer takes the datasets after preprocessing has been applied, so setting this has nothing to do with the Trainer class.

Thank you for the answer. If you notice in the blog post, the attention_mask feature is not included in the final dataset (the one after the preprocessing is done). Therefore, is it correct to assume that we do not pass the argument attention_mask in the model after all, even though we should?

Nevertheless, in the DataCollator class, we use the value -100 in the arrays that contain the labels of the input, to indicate what parts correspond to padding tokens. Is this something equivalent?

Thanks in advance.

Yes, the -100 indicates to the loss function the corresponding tokens should be ignored in the loss computation.

Good to know. I suspected they were equivalent options but I was not certain. I noticed that during the test set evalutation, the attention_mask is indeed passed due to the fact that there are no -100 values in this case (since there is not dataloader/data collator in the test set case).

Thanks for the answers.

I thought about this a bit now, and I have one more question if you can answer: the attention mask takes care of what tokens the model attends to. If we do not pass the attention_mask to the model, does the -100 value takes cares of that as well?

As you said, it indicates to the loss function what tokens to ignore. But the decision to attend to specific tokens or not is something different, isn’t it?

After researching this in the docs, I found out the following things:

  1. In the blog post I have provided, there is a Data Collator class that is used. This class is responsible for preparing a batch before using it as input for the model. By replicating the inner workings of this class, I realized that the batch it returns does include the feature attention_mask, even though it is not included in the original dataset that is fed to this Data Collator class.
  2. After this, by looking in the docs for the Trainer class and in the docs that show the various Trainer class utilities, in the method train() of the first one can see that the model is iterating over all the batches provided by a PyTorch dataloader. Each batch is passed into the model (see the method compute_loss() in the first link above), by the following line of code: model(**inputs).

Note that the inputs variable is one batch provided to us by the dataloader, which is a dictionary that includes the attention_mask key, as mentioned in my first point. This key is there, and it is not removed anywhere, which leads me to assume that the line model(**inputs) implicitly passes the attention_mask in the model, as requested by the official docs of the Wav2Vec2ForCTC.

This is what I’ve found, and I’d really appreciate it if someone much more experienced than me can confirm these results.