I implemented my custom Bert Binary Classification Model class, by adding a classifier layer on top of Bert Model (attached below). However, the accuracy/metrics are significantly different when I train with the official BertForSequenceClassification model, which makes me wonder if I am missing somehting in my class.
Few Doubts I have:
While loading the official
from_pretrained are the classifiers weight initialized as well from pretrained model or they are randomly initialized? Because in my custom class they are randomly initialized.
def __init__(self, encoder='bert-base-uncased',
self.config = AutoConfig.from_pretrained(encoder)
self.encoder = AutoModel.from_pretrained(self.config)
self.dropout = nn.Dropout(hidden_dropout_prob)
self.classifier = nn.Linear(self.config.hidden_size, num_labels)
def forward(self, input_sent):
outputs = self.encoder(input_ids=input_sent['input_ids'],
pooled_output = self.dropout(outputs)
# for both tasks
logits = self.classifier(pooled_output)
the weights of the SequenceClassification head are initialized randomly.
See this page https://huggingface.co/transformers/training.html
When we instantiate a model with
from_pretrained() , the model configuration and pre-trained weights of the specified model are used to initialize the model. The library also includes a number of task-specific final layers or ‘heads’ whose weights are instantiated randomly when not present in the specified pre-trained model. For example, instantiating a model with
BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) will create a BERT model instance with encoder weights copied from the
bert-base-uncased model and a randomly initialized sequence classification head on top of the encoder with an output size of 2.
This makes sense. In that case why is my custom BERT Classification Model’s accuracy lower than the official BertForSequnceClassification?
It’s a good question, but I don’t know the answer, sorry.
(When I tried to add a custom head to a BERT model, I couldn’t get it to learn at all!).
How much different is the accuracy? If it’s only a bit, then it could be just random chance.
When you fine-tune, are you freezing the main BERT layers? I think by default fine-tuning will propagate back into the main layers, which might not be what you want. Not sure that would be any different with the official SequenceClassification head though.
Have you looked at the code that is used for the official SequenceClassification head? This post Which loss function in bertforsequenceclassification regression includes a link to the GitHub page for the code.