How exactly does datasets versioning work?

BramVanroy · July 25, 2022, 7:11pm

I just created my own dataset. I followed the guide and uploaded the necessary files. Then I found some duplicates, so I decided to update the data files (and pushed those to the remote git) and in my data loading script I also bumped the Version. I then thought that I might as well run the datasets-cli test script again to be sure, and to my surprise the script did not download the new version of the dataset and used previously cached data files.

So I am a bit confused about versioning in datasets and how to use it as a creator: how can I specify that some committed data is new and different from a previous version, so that a user will automatically download this new version rather than use a previously cached one? Git tagging? You can find my dataset repo structure here.

lhoestq · July 27, 2022, 9:26am

By default, data are downloaded from the main branch. What you experienced is maybe an issue with datasets-cli test ? What command did you run ?

BramVanroy · July 27, 2022, 9:46am

I ran the required command:

datasets-cli test .\datasets\hebban-reviews\ --save_infos --all_configs

which re-used cache and did not download the newly committed (and pushed) data files. Does the cache also compare commits, or just checks whether it is the same branch?

lhoestq · July 27, 2022, 10:08am

Your dataset script uses this base URL to download the files: https://huggingface.co/datasets/BramVanroy/hebban-reviews/resolve/main/. I guess the cache didn’t check for the ETag to re-download the file (for reference, use_etag=False is set here in the source code)

Can you try passing a relative path to dl_manager.download_and_extract instead of the URL ?

files = dl_manager.download_and_extract({
    "train": "train.jsonl.gz",
    "test": "test.jsonl.gz"
})

this way it will take the commit hash into account when caching the files.

BramVanroy · July 27, 2022, 10:26am

That seems to work! Related, and with respect to the discussion on Slack: am I right in assuming that (copy-paste): you can indeed use semantic versioning on your dataset through git tags? So if I understand correctly it should go like this:

dataset info in data loading script contains the version so that this metadata can be retrieved from within Python code. This version is also used to save the dataset to disk under this version
data files are versioned based on their git commit/tag → when given, a revision will look for these in the repository and download the corresponding files from that tag/commit
→ you’ll have to make sure that the Version specified in your loading script corresponds with the version of your git tag (if you want to use semantic versioning)

Is that correct?

lhoestq · July 27, 2022, 4:16pm

If you want to use semantic versioning, this is indeed a good approach

Datasets don’t always live in a git repository, that’s why versioning is also included in the python code, and it is taken into account when caching the dataset.

Topic		Replies	Views
Dataset_infos.json getting cached? 🤗Datasets	2	1321	August 4, 2022
Add dataset revision to a created dataset 🤗Datasets	3	864	August 25, 2022
Different versions of dataset in own repository 🤗Datasets	0	280	July 10, 2023
How do I get the dataset loader working with multiple versions? 🤗Datasets	4	1562	November 8, 2022
Datasets use out of date loading script Beginners	0	37	September 3, 2024

How exactly does datasets versioning work?

Related topics