Training a model with custom attention masks in each layer

Pramodith · December 6, 2023, 6:12pm

Hey team,

I’m trying to train a model (using BERT for now but would like to use others (encoder/decoder)) with a custom attention mask. The attention masks have differences based on the level of the layer as well as the token’s position. A simple scheme is to have the normal dense self attention for the first k layers and all latter layers use sparse attention.

I was wondering if there are any simple ways of accomplishing this with a pre-trained model. Since, this is a change in logic at the level of the forward pass of the model, do I need to inherit the corresponding Model Class for e.g. BertForSequenceClassification and override the relevant functions or is there an easier way?

Topic		Replies	Views
Can I use a custom attention layer while still leveraging a pre-trained BERT model? 🤗Transformers	4	24	July 8, 2025
Specify different attention masks for different layers 🤗Transformers	0	220	January 16, 2023
Bert attention mask question 🤗Transformers	4	1203	March 11, 2024
Pretrain own model 🤗Transformers	0	270	October 23, 2023
Masking task with BERT on time serires Research	0	24	October 21, 2024

Training a model with custom attention masks in each layer

Related topics