Using EXTREMELY small dataset to finetune BERT

Hi, I have a domain-specific language classification problem that I am attempting to use a bert model for.

My approach has been to take the standard pretrained bert model and run further unsupervised learning using domain-specific language corpora (using TSDAE training from the Sentence-Transformer framework).

I am now trying to take this domain-trained model and finetune it for a classification task. The problem is I only have an extremely small labelled dataset (~1000 samples), I have been running a few training experiments and surprisingly have received very good results that I am very sceptical of.

The task is to take natural language text and classify it to 1 of 5 classes. Here is my training setup:

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.0001, momentum=0.9)
class BertClassification(nn.Module):
    def __init__(self):
        super(BertClassification, self).__init__()
        self.bert = BertModel.from_pretrained("TSDAE_model/0_Transformer")
        self.to_class = nn.Linear(768, 5)

    def forward(self, x):
        x = self.bert(x)[0][:,0,:]
        x = self.to_class(x)
        return x

And here are the training results:
image

I am very unsure of how trust worthy these results are as the dataset is so small. I have also tried freezing the bert weights and just training the self.to_class linear layer (~4000 params) but the model peaks at only about 50% accuracy.

I was hoping someone may be able to help me decide if this is an appropriate training strategy for this dataset or if maybe I should look at alternatives. Thanks!

hey @JoshuaP, are your 5 classes equally balanced? if not, you might be better off charting a metric like the f1-score which tends to be less biased by cases where you have a lot of examples in just a few classes.

another idea would be to implement a baseline (e.g. the classic naive bayes :smiley:) and see how that compares against your transformer model.

finally you could try cross-validation (with a stratified split if your classes aren’t balanced) to mitigate some of the problems that come from doing a train/test split with small datasets

1 Like

Hi @lewtun my classes aren’t balanced but i’m using a sample to get an even 20% per class in each batch.

Thanks for the suggestions, I will try and implement a simple baseline and use cross-validation!

I got no suggestion but rather a question :sweat_smile:
Would you mind showing me your whole code? I am currently trying to write a token classifier / NER with also very little data. But I’m not quite sure how to define the optimizer only with the linear layers and freezing all BERT weights.

~1000 samples is tiny? I’ve had a good experience training classification models with 1k samples honestly - but nothing below that. That accuracy/loss score looks a lot like our results as well.

Hi,
I have a similar problem and it seems u can help me!
My data set has 500 samples (with 250 samples for each class). Surprisingly, I trained a lightgbm classifier on the dataset and it scored 1 for all binary classification metrics!! (precision, recall, and f1 score). Though its scores on the valid set were not good enough.
On the other hand, I trained a Roberta classifier and after 50 epochs, it reached a score of 0.5 !!! In other words, my transformer couldn’t learn as much as a lightgbm model!!
I think there is a subtle mistake that I can’t figure out!
Could you please give me suggestions?

Hello,
I want to use k-fold cross validation, but I’m not sure how to do that. Can you help me or point me in a right direction?
Thank you.