How can I tell what each dataset was used for?

Fadi12 · June 30, 2025, 10:42am

Hello,
In many model cards, there’s a list of datasets — sometimes including several different ones. How can I determine which datasets were used for training, fine-tuning, or evaluation when it’s not explicitly specified?

For example, in the model card for sileod/deberta-v3-base-tasksource-nli, many datasets are listed. How can I find out which specific ones were actually used for training?

Thanks!

John6666 · June 30, 2025, 10:51am

Dataset tags are sometimes automatically assigned by the trainer, but they are generally optional fields filled in by the model author, and there is no established method for obtaining further details.

However, in cases such as models related to academic papers, detailed information such as the actual training code used may be available on GitHub or in the paper linked from the model card. There appears to be information available for this model.

Additionally, you can directly contact the author through the community section of each model.

Topic		Replies	Views
Tag a model related to a dataset 🤗Datasets	1	267	May 5, 2021
Does anyone know a list of LLMs that provide the list of data sources that were used in the training? Beginners	1	46	January 18, 2025
How to select suitable Dataset for a model? Beginners	0	166	November 9, 2022
How to add a language tag to model/datasets 🤗Hub	2	1358	April 28, 2022
How can I know which dataset fits my model? 🤗Datasets	0	148	August 8, 2023

How can I tell what each dataset was used for?

Related topics