BERT model with duplicated data and f1 score

Hello guys good evening.

I have a question around data duplication while training BERT Model. So I have a dataset with 100k comments with 50 unique labels. The dataset contains some comments with duplicates up to 4 times e.g. some comment is repeated 4 times with same label. My question is should I train my model only using unique comments? I ask this because if I train my model with unique comments my weighted f1 score is 79, but when I use data with duplicates the f1 score jumps to 93. So should I remove the duplicates?

My guess is that the model is learning more with duplicates as it sees those comments and the associated labels more.

Another question…my comments are around 300 words related to customer complaints, does it make sense for me to use bert-base-uncased or would moving to bert large help in anyway?

your feedback / suggestion is appreciated

Thanks

@haris525

Having duplicates in the training data should be fine, but these should not get leaked into the test set.

Keeping everything unique should be ideal. This seems like a data leak into the test set.

2 Likes

thank you so much @dhruvmetha !

I will try it with your approach. Thanks again