BERT model with duplicated data and f1 score

haris525 · March 1, 2022, 10:42pm

Hello guys good evening.

I have a question around data duplication while training BERT Model. So I have a dataset with 100k comments with 50 unique labels. The dataset contains some comments with duplicates up to 4 times e.g. some comment is repeated 4 times with same label. My question is should I train my model only using unique comments? I ask this because if I train my model with unique comments my weighted f1 score is 79, but when I use data with duplicates the f1 score jumps to 93. So should I remove the duplicates?

My guess is that the model is learning more with duplicates as it sees those comments and the associated labels more.

Another question…my comments are around 300 words related to customer complaints, does it make sense for me to use bert-base-uncased or would moving to bert large help in anyway?

your feedback / suggestion is appreciated

Thanks

dhruvmetha · March 1, 2022, 11:53pm

@haris525

Having duplicates in the training data should be fine, but these should not get leaked into the test set.

Keeping everything unique should be ideal. This seems like a data leak into the test set.

haris525 · March 2, 2022, 5:28pm

thank you so much @dhruvmetha !

I will try it with your approach. Thanks again

Topic		Replies	Views
Using the same dataset for fine-tuning and training Beginners	2	1522	May 7, 2022
Adding small data in fine tune model - bert Models	0	338	October 20, 2022
Questions about my first code on fine-tuning BERT model for text-classification Beginners	0	1507	April 26, 2022
BERT Model always with different results Beginners	0	432	January 17, 2022
Bert for Text classification evaluation - help needed Beginners	0	198	September 7, 2023

BERT model with duplicated data and f1 score

Related topics