I am now trying to take this domain-trained model and finetune it for a classification task. The problem is I only have an extremely small labelled dataset (~1000 samples), I have been running a few training experiments and surprisingly have received very good results that I am very sceptical of.
The task is to take natural language text and classify it to 1 of 5 classes. Here is my training setup:
class BertClassification(nn.Module):
def __init__(self):
super(BertClassification, self).__init__()
self.bert = BertModel.from_pretrained("TSDAE_model/0_Transformer")
self.to_class = nn.Linear(768, 5)
def forward(self, x):
x = self.bert(x)[0][:,0,:]
x = self.to_class(x)
return x
And here are the training results:
I am very unsure of how trust worthy these results are as the dataset is so small. I have also tried freezing the bert weights and just training the self.to_class linear layer (~4000 params) but the model peaks at only about 50% accuracy.
I was hoping someone may be able to help me decide if this is an appropriate training strategy for this dataset or if maybe I should look at alternatives. Thanks!
hey @JoshuaP, are your 5 classes equally balanced? if not, you might be better off charting a metric like the f1-score which tends to be less biased by cases where you have a lot of examples in just a few classes.
another idea would be to implement a baseline (e.g. the classic naive bayes ) and see how that compares against your transformer model.
finally you could try cross-validation (with a stratified split if your classes aren’t balanced) to mitigate some of the problems that come from doing a train/test split with small datasets
I got no suggestion but rather a question
Would you mind showing me your whole code? I am currently trying to write a token classifier / NER with also very little data. But I’m not quite sure how to define the optimizer only with the linear layers and freezing all BERT weights.
~1000 samples is tiny? I’ve had a good experience training classification models with 1k samples honestly - but nothing below that. That accuracy/loss score looks a lot like our results as well.
Hi,
I have a similar problem and it seems u can help me!
My data set has 500 samples (with 250 samples for each class). Surprisingly, I trained a lightgbm classifier on the dataset and it scored 1 for all binary classification metrics!! (precision, recall, and f1 score). Though its scores on the valid set were not good enough.
On the other hand, I trained a Roberta classifier and after 50 epochs, it reached a score of 0.5 !!! In other words, my transformer couldn’t learn as much as a lightgbm model!!
I think there is a subtle mistake that I can’t figure out!
Could you please give me suggestions?