Questions when using multiple datasets to finetune Deberta

faithk7 · April 6, 2024, 7:04pm

Hello Huggingface community,

I’m engaged in a Kaggle competition focusing on the identification of personally identifiable information through modeling (The Learning Agency Lab - PII Data Detection | Kaggle). We face a limitation with the original dataset’s size, leading us to augment it using data generated from large language models, resulting in multiple datasets.

An observation has been made: training the DeBERTa model with the original dataset plus one generated dataset improves the leaderboard score. Yet, combining the original dataset with two or more generated datasets reduces the score (The Learning Agency Lab - PII Data Detection | Kaggle). I’m seeking insights into potential causes and strategies for investigation.

Thank you for any guidance!

Topic		Replies	Views
Share your projects! Course	19	3794	February 18, 2025
How to train a model on multiple datasets Beginners	1	2799	September 18, 2023
Grouphug: multi-task, multi-dataset training with 🤗 transformers/datasets Research	0	2490	June 15, 2022
Tasksource - Datasets harmonization for frictionless multi-task/evaluation 🤗Datasets	0	255	February 10, 2023
Model predicts only one label on a multi-label Text Classification task (XLMRoberta) Beginners	0	449	August 14, 2023

Questions when using multiple datasets to finetune Deberta

Related topics