I got a really basic question on the whole BERT/finetune BERT for classification topic:
I got a dataset with customer reviews which consists of 7 different labels such as “Customer Service”, “Tariff”, “Provider related” etc. .
My dataset contains 12700 not labelled customer reviews and I labelled 1100 reviews for my classification task. Now to my questions:
Could it be enough to take an existing BERT model and fine-tune it with AutoModelForSequenceClassification on my specific task?
Are the 1100 labelled reviews (around 150 per each class) enough to train that?
What are other approaches?
I am completely new to the whole NLP task and working on it for 3 weeks. I just understood that more traditional approaches are often outperformed by transformers…
Thank you so much for your reply and helping me to go on. For now it looks like I have to use MultiLabel Classification instead of MultiClass…does it make any difference in the way I set up the transformer?
If a given review can have more than 1 label, then it’s a multi-label text classification problem indeed.
The only thing you’ll need to change is setting the problem_type to multi_label_classification when instantiating an xxxForSequenceClassification model. Suppose that we have 7 different labels and we want to do multi-label classification, then you can for example instantiate a BERT model as follows:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", problem_type="multi_label_classification", num_labels=7)