Hello guys good evening.
I have a question around data duplication while training BERT Model. So I have a dataset with 100k comments with 50 unique labels. The dataset contains some comments with duplicates up to 4 times e.g. some comment is repeated 4 times with same label. My question is should I train my model only using unique comments? I ask this because if I train my model with unique comments my weighted f1 score is 79, but when I use data with duplicates the f1 score jumps to 93. So should I remove the duplicates?
My guess is that the model is learning more with duplicates as it sees those comments and the associated labels more.
Another question…my comments are around 300 words related to customer complaints, does it make sense for me to use bert-base-uncased or would moving to bert large help in anyway?
your feedback / suggestion is appreciated
Thanks