Text classification on small dataset (8K)


I’m trying to find a good architecture for a model which has to do text classification. The domain is a chat-bot doing a help-desk. Its goal is to book appointments for customers who need some machines to be repaired. The current model only has to classify single utterances in one of the 20 categories.

I have around 8K examples in my data set. I’m wondering if there is some recommended type of architecture/model based on tranformers for this type of model.

I tried a model with a frozen DistilBERT layer followed by a fully connected layer, before a classification layer.

So basically the same architecture than DistilBertForSequenceClassification presentend here:
but with the DistilBERT layer frozen.

But the results were so so. So I’m thinking about two things:

  1. Is there something more appropriate than DistilBERT for my set-up?
  2. Should I maybe keep only some layers of DistilBERT frozen and not all of them?

If anyone has a suggestion, I would be glad to hear about it.

Thank you in advance!

In my experience (I worked with BERT and RoBERTa), not updating the transformer model parameters during fine-tuning resulted in lower accuracy and slower decrease in the loss value. This might mean that the fully connected layer alone is not enough to model the task at hand. I suggest updating the parameters of the DistilBert model as well, which is what fine-tuning is for.

I should also note that freezing the first 6 layers of BERT-base did not decrease the accuracy of the model significantly, in my case.

1 Like