How exactly does datasets versioning work?

I just created my own dataset. I followed the guide and uploaded the necessary files. Then I found some duplicates, so I decided to update the data files (and pushed those to the remote git) and in my data loading script I also bumped the Version. I then thought that I might as well run the datasets-cli test script again to be sure, and to my surprise the script did not download the new version of the dataset and used previously cached data files.

So I am a bit confused about versioning in datasets and how to use it as a creator: how can I specify that some committed data is new and different from a previous version, so that a user will automatically download this new version rather than use a previously cached one? Git tagging? You can find my dataset repo structure here.

By default, data are downloaded from the main branch. What you experienced is maybe an issue with datasets-cli test ? What command did you run ?

I ran the required command:

datasets-cli test .\datasets\hebban-reviews\ --save_infos --all_configs  

which re-used cache and did not download the newly committed (and pushed) data files. Does the cache also compare commits, or just checks whether it is the same branch?

Your dataset script uses this base URL to download the files: https://huggingface.co/datasets/BramVanroy/hebban-reviews/resolve/main/. I guess the cache didn’t check for the ETag to re-download the file (for reference, use_etag=False is set here in the source code)

Can you try passing a relative path to dl_manager.download_and_extract instead of the URL ?

files = dl_manager.download_and_extract({
    "train": "train.jsonl.gz",
    "test": "test.jsonl.gz"
})

this way it will take the commit hash into account when caching the files.

That seems to work! Related, and with respect to the discussion on Slack: am I right in assuming that (copy-paste): you can indeed use semantic versioning on your dataset through git tags? So if I understand correctly it should go like this:

  • dataset info in data loading script contains the version so that this metadata can be retrieved from within Python code. This version is also used to save the dataset to disk under this version
  • data files are versioned based on their git commit/tag → when given, a revision will look for these in the repository and download the corresponding files from that tag/commit
  • → you’ll have to make sure that the Version specified in your loading script corresponds with the version of your git tag (if you want to use semantic versioning)

Is that correct?

If you want to use semantic versioning, this is indeed a good approach :slight_smile:

Datasets don’t always live in a git repository, that’s why versioning is also included in the python code, and it is taken into account when caching the dataset.