Using EXTREMELY small dataset to finetune BERT

JoshuaP · July 30, 2021, 1:37pm

Hi, I have a domain-specific language classification problem that I am attempting to use a bert model for.

My approach has been to take the standard pretrained bert model and run further unsupervised learning using domain-specific language corpora (using TSDAE training from the Sentence-Transformer framework).

I am now trying to take this domain-trained model and finetune it for a classification task. The problem is I only have an extremely small labelled dataset (~1000 samples), I have been running a few training experiments and surprisingly have received very good results that I am very sceptical of.

The task is to take natural language text and classify it to 1 of 5 classes. Here is my training setup:

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.0001, momentum=0.9)

class BertClassification(nn.Module):
    def __init__(self):
        super(BertClassification, self).__init__()
        self.bert = BertModel.from_pretrained("TSDAE_model/0_Transformer")
        self.to_class = nn.Linear(768, 5)

    def forward(self, x):
        x = self.bert(x)[0][:,0,:]
        x = self.to_class(x)
        return x

And here are the training results:

I am very unsure of how trust worthy these results are as the dataset is so small. I have also tried freezing the bert weights and just training the self.to_class linear layer (~4000 params) but the model peaks at only about 50% accuracy.

I was hoping someone may be able to help me decide if this is an appropriate training strategy for this dataset or if maybe I should look at alternatives. Thanks!

lewtun · July 31, 2021, 12:41am

hey @JoshuaP, are your 5 classes equally balanced? if not, you might be better off charting a metric like the f1-score which tends to be less biased by cases where you have a lot of examples in just a few classes.

another idea would be to implement a baseline (e.g. the classic naive bayes ) and see how that compares against your transformer model.

finally you could try cross-validation (with a stratified split if your classes aren’t balanced) to mitigate some of the problems that come from doing a train/test split with small datasets

JoshuaP · August 6, 2021, 4:55pm

Hi @lewtun my classes aren’t balanced but i’m using a sample to get an even 20% per class in each batch.

Thanks for the suggestions, I will try and implement a simple baseline and use cross-validation!

LadyHangaku · October 28, 2021, 10:02am

I got no suggestion but rather a question
Would you mind showing me your whole code? I am currently trying to write a token classifier / NER with also very little data. But I’m not quite sure how to define the optimizer only with the linear layers and freezing all BERT weights.

rosenjcb · October 28, 2021, 5:07pm

~1000 samples is tiny? I’ve had a good experience training classification models with 1k samples honestly - but nothing below that. That accuracy/loss score looks a lot like our results as well.

savasci · January 7, 2023, 10:53pm

Hi,
I have a similar problem and it seems u can help me!
My data set has 500 samples (with 250 samples for each class). Surprisingly, I trained a lightgbm classifier on the dataset and it scored 1 for all binary classification metrics!! (precision, recall, and f1 score). Though its scores on the valid set were not good enough.
On the other hand, I trained a Roberta classifier and after 50 epochs, it reached a score of 0.5 !!! In other words, my transformer couldn’t learn as much as a lightgbm model!!
I think there is a subtle mistake that I can’t figure out!
Could you please give me suggestions?

BbKWlf · February 1, 2023, 3:24am

Hello,
I want to use k-fold cross validation, but I’m not sure how to do that. Can you help me or point me in a right direction?
Thank you.

Topic		Replies	Views
Multi-class Classification Basics Beginners	4	4622	August 24, 2021
Questions about my first code on fine-tuning BERT model for text-classification Beginners	0	1511	April 26, 2022
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8438	November 14, 2024
Fine-tuning BERT Model on domain specific language Models	1	1799	January 5, 2021
Domain Specific Pretraining using BERT models vs other smaller architecture models 🤗Transformers	0	210	December 7, 2023

Using EXTREMELY small dataset to finetune BERT

Related topics