Hi everyone!
I’m fine-tuning the google-bert/bert-base-multilingual-cased
model for content moderation classification, using AutoModelForSequenceClassification
with the problem set as multi_label_classification
.
In the initial training, I used 500k messages for training and 100k for validation. This model will go into production, and all blocked messages will be moderated by a team, allowing us to collect corrected data daily.
I have some questions about best practices for continuous training:
-
Continuous training with new data:
- I did an initial test using the already fine-tuned model as a starting point and trained with new data (approximately 2k messages, with 1800 “safe” and 200 “blocked”). However, the results were unsatisfactory, as the model became biased towards predicting “safe.”
- Is it better to continue training from the already fine-tuned model using only the new data? If so, what are the best practices for this procedure?
-
Training from scratch with new and old data:
- Would it be more effective to add the new data to the initial dataset and perform the fine-tuning from scratch using the
google-bert/bert-base-multilingual-cased
model again?
- Would it be more effective to add the new data to the initial dataset and perform the fine-tuning from scratch using the
-
Minimum data quantity:
- Is there any rule of thumb about the minimum amount of data needed to perform a new training and achieve significant results?
I appreciate any help or guidance in advance!
Thank you!