Using EXTREMELY small dataset to finetune BERT

Hi, I have a domain-specific language classification problem that I am attempting to use a bert model for.

My approach has been to take the standard pretrained bert model and run further unsupervised learning using domain-specific language corpora (using TSDAE training from the Sentence-Transformer framework).

I am now trying to take this domain-trained model and finetune it for a classification task. The problem is I only have an extremely small labelled dataset (~1000 samples), I have been running a few training experiments and surprisingly have received very good results that I am very sceptical of.

The task is to take natural language text and classify it to 1 of 5 classes. Here is my training setup:

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.0001, momentum=0.9)
class BertClassification(nn.Module):
    def __init__(self):
        super(BertClassification, self).__init__()
        self.bert = BertModel.from_pretrained("TSDAE_model/0_Transformer")
        self.to_class = nn.Linear(768, 5)

    def forward(self, x):
        x = self.bert(x)[0][:,0,:]
        x = self.to_class(x)
        return x

And here are the training results:
image

I am very unsure of how trust worthy these results are as the dataset is so small. I have also tried freezing the bert weights and just training the self.to_class linear layer (~4000 params) but the model peaks at only about 50% accuracy.

I was hoping someone may be able to help me decide if this is an appropriate training strategy for this dataset or if maybe I should look at alternatives. Thanks!

hey @JoshuaP, are your 5 classes equally balanced? if not, you might be better off charting a metric like the f1-score which tends to be less biased by cases where you have a lot of examples in just a few classes.

another idea would be to implement a baseline (e.g. the classic naive bayes :smiley:) and see how that compares against your transformer model.

finally you could try cross-validation (with a stratified split if your classes aren’t balanced) to mitigate some of the problems that come from doing a train/test split with small datasets

Hi @lewtun my classes aren’t balanced but i’m using a sample to get an even 20% per class in each batch.

Thanks for the suggestions, I will try and implement a simple baseline and use cross-validation!

I got no suggestion but rather a question :sweat_smile:
Would you mind showing me your whole code? I am currently trying to write a token classifier / NER with also very little data. But I’m not quite sure how to define the optimizer only with the linear layers and freezing all BERT weights.

~1000 samples is tiny? I’ve had a good experience training classification models with 1k samples honestly - but nothing below that. That accuracy/loss score looks a lot like our results as well.