For my PhD research, I’m considering (with colleagues) manually labeling data for sentiment and some broad medical classification. I’m curious to know your thoughts on the optimal quantity of training data needed for fine-tuning a BERT model. Thank you all and happy Friday
Anyone out there?
In our paper on portuguese toxic data we labeled 21k examples (41% toxic, 59% non-toxic). Our bert achieved 75% f1-score. We also did an analysis on training-curve increasing the amount of examples to check the model’s performance, take a look at the paper: [2010.04543] Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis
Hope it helps
Obrigado! I appreciate you sending a link to your paper and for your insight. How long did it take to complete the labeling task?
We crowdsourced the annotation with 42 people divided into 14 groups, each group annotated 1.5k examples (we had 3 annotators per example, so in total we had 63k examples labeled). The entire process from recruiting the annotators to having the complete dataset took about a month
I would say the answer really depends on the number of classes you use for classification. Of course, the more classes you have, the more samples you are going to need.
However, in practice, (if you have less than ~20 classes), I would say labeling some thousands of samples should do the trick (like between 1,000 and 10,000 samples).
Labeling by hand can be exhausting, so first try to label 1,000 samples, and you can continue building up your dataset as you fine-tune BERT on it. Also, a nice trick for you is to make use of data augmentation to enrich your dataset, by masking random tokens on each sample at each epoch (of course, you want to make sure your token is not an essential one for the classification task, so you can use tf-idf to make up a distribution of probability of masking a token in the sample).
Last but not least, try not to use many epochs for training your BERT model. Most open source references (include this one ) would suggest fine-tuning BERT model on not more than 4/5 epochs, and make use of a warmup scheduler during training.
That’s it for me. I hope I was able to help !
Thank you so much for your ideas! We have been labeling and YES it is exhausting haha. I apologize that it has taken me so long to respond. Cheers!