How does Hugging Face Hub jointly versions models and their training data?

Hi,

I see HF Models are versioned, and HF Datasets are versioned too.
How do I know which version of a dataset went into training a given version of a model? (interested in answer both for public and private hubs)

Hi Olivier,

in the model metadata spec, you’ll see we suggest encoding the version (= git-revision) of the dataset inside the model-index dict, i.E. the place where you encode your eval results in a structured way:

models that are pushed by the hf trainer should include that metadata most of the time

cc @sgugger @osanseviero

→ Does the HF Trainer automatically copy the git hash of the training dataset (if it detects it’s an HF Dataset) to the model metadata?

thanks

We don’t as I’m not sure we have that information in the dataset metadata (maybe it was added recently in which case we can make a PR to add support for this).

yes i think that should be possible, cc @lhoestq

Yup that will definitely be possible, there’s an issue on this here but we haven’t started working on this yet: dataset metadata for reproducibility · Issue #4129 · huggingface/datasets · GitHub Contributions are welcome ^^