Continuous training of google-bert/bert-base-uncased

Hi everyone!

I’m fine-tuning the google-bert/bert-base-multilingual-cased model for content moderation classification, using AutoModelForSequenceClassification with the problem set as multi_label_classification.

In the initial training, I used 500k messages for training and 100k for validation. This model will go into production, and all blocked messages will be moderated by a team, allowing us to collect corrected data daily.

I have some questions about best practices for continuous training:

  1. Continuous training with new data:

    • I did an initial test using the already fine-tuned model as a starting point and trained with new data (approximately 2k messages, with 1800 “safe” and 200 “blocked”). However, the results were unsatisfactory, as the model became biased towards predicting “safe.”
    • Is it better to continue training from the already fine-tuned model using only the new data? If so, what are the best practices for this procedure?
  2. Training from scratch with new and old data:

    • Would it be more effective to add the new data to the initial dataset and perform the fine-tuning from scratch using the google-bert/bert-base-multilingual-cased model again?
  3. Minimum data quantity:

    • Is there any rule of thumb about the minimum amount of data needed to perform a new training and achieve significant results?

I appreciate any help or guidance in advance!

Thank you!

1 Like

I have the same problem, so would appreciate any updates on this topic

1 Like