Hello Huggingface community,
I’m engaged in a Kaggle competition focusing on the identification of personally identifiable information through modeling (The Learning Agency Lab - PII Data Detection | Kaggle). We face a limitation with the original dataset’s size, leading us to augment it using data generated from large language models, resulting in multiple datasets.
An observation has been made: training the DeBERTa model with the original dataset plus one generated dataset improves the leaderboard score. Yet, combining the original dataset with two or more generated datasets reduces the score (The Learning Agency Lab - PII Data Detection | Kaggle). I’m seeking insights into potential causes and strategies for investigation.
Thank you for any guidance!