Thoughts on quantity of training data for fine tuning

Cheltone · February 19, 2022, 1:31am

Hello Everyone,
For my PhD research, I’m considering (with colleagues) manually labeling data for sentiment and some broad medical classification. I’m curious to know your thoughts on the optimal quantity of training data needed for fine-tuning a BERT model. Thank you all and happy Friday

Cheltone · February 21, 2022, 9:17pm

Anyone out there?

JAugusto97 · February 21, 2022, 9:27pm

Hey!

In our paper on portuguese toxic data we labeled 21k examples (41% toxic, 59% non-toxic). Our bert achieved 75% f1-score. We also did an analysis on training-curve increasing the amount of examples to check the model’s performance, take a look at the paper: [2010.04543] Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis

Hope it helps

Cheltone · February 21, 2022, 9:39pm

Obrigado! I appreciate you sending a link to your paper and for your insight. How long did it take to complete the labeling task?

JAugusto97 · February 21, 2022, 9:58pm

We crowdsourced the annotation with 42 people divided into 14 groups, each group annotated 1.5k examples (we had 3 annotators per example, so in total we had 63k examples labeled). The entire process from recruiting the annotators to having the complete dataset took about a month

eli4s · February 22, 2022, 8:57pm

Hi !
I would say the answer really depends on the number of classes you use for classification. Of course, the more classes you have, the more samples you are going to need.
However, in practice, (if you have less than ~20 classes), I would say labeling some thousands of samples should do the trick (like between 1,000 and 10,000 samples).
Labeling by hand can be exhausting, so first try to label 1,000 samples, and you can continue building up your dataset as you fine-tune BERT on it. Also, a nice trick for you is to make use of data augmentation to enrich your dataset, by masking random tokens on each sample at each epoch (of course, you want to make sure your token is not an essential one for the classification task, so you can use tf-idf to make up a distribution of probability of masking a token in the sample).
Last but not least, try not to use many epochs for training your BERT model. Most open source references (include this one ) would suggest fine-tuning BERT model on not more than 4/5 epochs, and make use of a warmup scheduler during training.
That’s it for me. I hope I was able to help !

Cheltone · March 10, 2022, 10:35pm

Thank you so much for your ideas! We have been labeling and YES it is exhausting haha. I apologize that it has taken me so long to respond. Cheers!

Topic		Replies	Views
Multi-class Classification Basics Beginners	4	4630	August 24, 2021
Fine-tuning Bert/Roberta for multi-label sentiment analysis Beginners	0	1603	November 8, 2021
Adding small data in fine tune model - bert Models	0	343	October 20, 2022
Using EXTREMELY small dataset to finetune BERT 🤗Transformers	6	13201	February 1, 2023
Fine-tuning a pretrained model - how many data samples are needed for effectiveness? 🤗Transformers	0	1770	April 4, 2023

Thoughts on quantity of training data for fine tuning

Related topics