I am trying to fine-tune a pre-trained BigBird
on a custom task; the pre-training dataset has about 25k samples for a model of 76M parameters while the target datasets is about 800 samples.
During fine-tuning, I am unable to get the loss to converge which is highly volatile (appears like a noise sine wave) - it seems to be that my model might be underfitting on the dataset due to its high sequence length and/or complexity.
For the BERT model at the core,
the size or complexity of the âLinearâ block can be adjusted to accommodate tasks however for the configuration accessible by Transformers
- vocab_size (int, optional, defaults to 50358) â Vocabulary size of the BigBird model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BigBirdModel.
- hidden_size (int, optional, defaults to 768) â Dimension of the encoder layers and the pooler layer.
- num_hidden_layers (int, optional, defaults to 12) â Number of hidden layers in the Transformer encoder.
- num_attention_heads (int, optional, defaults to 12) â Number of attention heads for each attention layer in the Transformer encoder.
- intermediate_size (int, optional, defaults to 3072) â Dimension of the âintermediateâ (i.e., feed-forward) layer in the Transformer encoder.
- hidden_act (str or function, optional, defaults to "gelu_new") â The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.
- hidden_dropout_prob (float, optional, defaults to 0.1) â The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
- attention_probs_dropout_prob (float, optional, defaults to 0.1) â The dropout ratio for the attention probabilities.
There is not such way to adjust that.
Does anyone have any idea how I might be able to customize BigBird - or should I instead extract the decoder output for use on my own network?