Adjusting parameters for the FC layers at the end

Neel-Gupta · July 19, 2021, 7:09pm

I am trying to fine-tune a pre-trained BigBird on a custom task; the pre-training dataset has about 25k samples for a model of 76M parameters while the target datasets is about 800 samples.

During fine-tuning, I am unable to get the loss to converge which is highly volatile (appears like a noise sine wave) - it seems to be that my model might be underfitting on the dataset due to its high sequence length and/or complexity.

For the BERT model at the core,

the size or complexity of the ‘Linear’ block can be adjusted to accommodate tasks however for the configuration accessible by Transformers

- vocab_size (int, optional, defaults to 50358) – Vocabulary size of the BigBird model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BigBirdModel.

- hidden_size (int, optional, defaults to 768) – Dimension of the encoder layers and the pooler layer.

- num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

- num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

- intermediate_size (int, optional, defaults to 3072) – Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

- hidden_act (str or function, optional, defaults to "gelu_new") – The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.

- hidden_dropout_prob (float, optional, defaults to 0.1) – The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.

- attention_probs_dropout_prob (float, optional, defaults to 0.1) – The dropout ratio for the attention probabilities.

There is not such way to adjust that.

Does anyone have any idea how I might be able to customize BigBird - or should I instead extract the decoder output for use on my own network?

Neel-Gupta · July 20, 2021, 12:03pm

Gm!
from the docs, it seems that obtaining the sequence embeddings from the pre-trained model is pretty easy using the last_hidden_state or hidden_states returned by BigBird model.

However, I would still prefer if there might be a way to modify the size of the Linear Layer in-place as that might be easier than using the embeddings and constructing another Pytorch model to interface with that.

Topic		Replies	Views
Bigbird pretraining Beginners	3	885	March 16, 2022
Replacing last layer of a fine-tuned model to use different set of labels Beginners	6	6579	December 23, 2021
BigBirdPegasus with attention_type="original_full" vs T5 🤗Transformers	0	253	March 11, 2022
Custom Tasks and BERT Fine Tuning Beginners	4	4999	October 30, 2020
Please advise on parameter settings! Models	0	15	October 15, 2024

Adjusting parameters for the FC layers at the end

Related topics