Adjusting parameters for the FC layers at the end

I am trying to fine-tune a pre-trained BigBird on a custom task; the pre-training dataset has about 25k samples for a model of 76M parameters while the target datasets is about 800 samples.

During fine-tuning, I am unable to get the loss to converge which is highly volatile (appears like a noise sine wave) - it seems to be that my model might be underfitting on the dataset due to its high sequence length and/or complexity.

For the BERT model at the core,

the size or complexity of the ‘Linear’ block can be adjusted to accommodate tasks however for the configuration accessible by Transformers

- vocab_size (int, optional, defaults to 50358) – Vocabulary size of the BigBird model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BigBirdModel.

- hidden_size (int, optional, defaults to 768) – Dimension of the encoder layers and the pooler layer.

- num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

- num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

- intermediate_size (int, optional, defaults to 3072) – Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

- hidden_act (str or function, optional, defaults to "gelu_new") – The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.

- hidden_dropout_prob (float, optional, defaults to 0.1) – The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.

- attention_probs_dropout_prob (float, optional, defaults to 0.1) – The dropout ratio for the attention probabilities.

There is not such way to adjust that.

Does anyone have any idea how I might be able to customize BigBird - or should I instead extract the decoder output for use on my own network?

from the docs, it seems that obtaining the sequence embeddings from the pre-trained model is pretty easy using the last_hidden_state or hidden_states returned by BigBird model.

However, I would still prefer if there might be a way to modify the size of the Linear Layer in-place as that might be easier than using the embeddings and constructing another Pytorch model to interface with that.