How does Hugging Face Hub jointly versions models and their training data?

OlivierCR · January 6, 2023, 10:45am

Hi,

I see HF Models are versioned, and HF Datasets are versioned too.
How do I know which version of a dataset went into training a given version of a model? (interested in answer both for public and private hubs)

julien-c · January 8, 2023, 2:49pm

Hi Olivier,

in the model metadata spec, you’ll see we suggest encoding the version (= git-revision) of the dataset inside the model-index dict, i.E. the place where you encode your eval results in a structured way:

github.com

huggingface/hub-docs/blob/main/modelcard.md?plain=1#L31


      
          - name: {model_id}
            results:
            - task:
                type: {task_type}             # Required. Example: automatic-speech-recognition
                name: {task_name}             # Optional. Example: Speech Recognition
              dataset:
                type: {dataset_type}          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
                name: {dataset_name}          # Required. A pretty name for the dataset. Example: Common Voice (French)
                config: {dataset_config}      # Optional. The name of the dataset configuration used in `load_dataset()`. Example: fr in `load_dataset("common_voice", "fr")`. See the `datasets` docs for more info: https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset.name
                split: {dataset_split}        # Optional. Example: test
                revision: {dataset_revision}  # Optional. Example: 5503434ddd753f426f4b38109466949a1217c2bb
                args:
                  {arg_0}: {value_0}          # Optional. Additional arguments to `load_dataset()`. Example for wikipedia: language: en
                  {arg_1}: {value_1}          # Optional. Example for wikipedia: date: 20220301
              metrics:
                - type: {metric_type}         # Required. Example: wer. Use metric id from https://hf.co/metrics
                  value: {metric_value}       # Required. Example: 20.90
                  name: {metric_name}         # Optional. Example: Test WER
                  config: {metric_config}     # Optional. The name of the metric configuration used in `load_metric()`. Example: bleurt-large-512 in `load_metric("bleurt", "bleurt-large-512")`. See the `datasets` docs for more info: https://huggingface.co/docs/datasets/v2.1.0/en/loading#load-configurations
                  args:
                    {arg_0}: {value_0}        # Optional. The arguments passed during `Metric.compute()`. Example for `bleu`: max_order: 4

models that are pushed by the hf trainer should include that metadata most of the time

cc @sgugger @osanseviero

OlivierCR · January 9, 2023, 10:29am

→ Does the HF Trainer automatically copy the git hash of the training dataset (if it detects it’s an HF Dataset) to the model metadata?

thanks

sgugger · January 9, 2023, 5:47pm

We don’t as I’m not sure we have that information in the dataset metadata (maybe it was added recently in which case we can make a PR to add support for this).

julien-c · January 9, 2023, 6:10pm

yes i think that should be possible, cc @lhoestq

lhoestq · January 13, 2023, 4:11pm

Yup that will definitely be possible, there’s an issue on this here but we haven’t started working on this yet: dataset metadata for reproducibility · Issue #4129 · huggingface/datasets · GitHub Contributions are welcome ^^

Topic		Replies	Views
Add dataset revision to a created dataset 🤗Datasets	3	868	August 25, 2022
Share your projects! Course	19	3854	February 18, 2025
How to properly handle model versions 🤗Hub	2	9969	November 17, 2023
How to best version a model after retraining? Models	1	411	September 28, 2022
Tag a model related to a dataset 🤗Datasets	1	267	May 5, 2021

How does Hugging Face Hub jointly versions models and their training data?

Related topics