How do I get the dataset loader working with multiple versions?


Our team has created an original version of a dataset on huggingface and a loader for it. The dataset is downloaded from a specific URL. We are now creating a new version of the dataset and it will have a new URL. We want users to be able to access both versions depending on their choice. How can I update the loader to handle both URLs and change based on the version. I’m pretty sure this is a solved problem so I wonder if there is example code for this?


Hi @cperiz! You can create a custom config class inherited from datasets.BuilderConfig and add a custom parameter for a URL, for example data_url, to it. Then you’ll be able to refer to it inside the _split_generators function as self.config.data_url.

This is how it’s done in SuperGlue. And you can read more detailed guide in the docs. :slight_smile:

1 Like

Thanks @polinaeterna for you help! This was helpful. So I’ve configured the code this way. I have two sets of datasets.BuilderConfigs with version 1.0.0 and version 1.0.1. I want users to have the option of downloading either one for use. Is there an argument I can pass in from datasets.load_dataset to pick the version a user wants? Right now the only way to pick different BuilderConfigs seems to be via its name. But I would like the algo to pick a BuilderConfig based on name and version. Do you know how to do this? Thanks!

@cperiz Similarly, you can add a parameter to your custom config corresponding to the version, say, dataset_version. But that it should be named differently from “version” because “version” is a standard library’s parameter which is used differently.
So you’ll have smth like this:

class YourCustomConfig(datasets.BuilderConfig):
    def __init__(self, *args, dataset_version=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.datasets_version = dataset_version if dataset_version else "first_version"
        self.data_url = "https://first/data/url" if dataset_version == "first_version" else "https://second/data/url"

Then you can load both versions like:

ds_v1 = load_dataset("your_namespace/your_dataset")   # default version when dataset_version=None is "first_version"
ds_v2 = load_dataset("your_namespace/your_dataset", dataset_version="second_version")

Alternatively, the clearer way is probably to create two default builder configs with different names, corresponding to different version, for example "v1" and "v2", so that you can load them like:
load_dataset("your_namespace/your_dataset", "v1").
Your builder class should have these configs in BUILDER_CONFIGS class variable:

BUILDER_CONFIGS = [YourCustomConfig("your_namespace/your_dataset", "v1"), YourCustomConfig("your_namespace/your_dataset", "v2")]

and then depending on the (which it either “v1” or “v2”) you choose data_url inside your script.

Tell me if it helps or you have more questions :slight_smile:

@polinaeterna thanks this certainly helps, but I also have another question. So our dataset has multiple subsets and we current offer users the ability to download one subset at a time or the full dataset using different BuilderConfigs with different names (lets assume names are like subset1,subset2all).

Now for two versions of the full dataset, as a user I would ideally be able to do something like

ds_v1 = load_dataset("your_namespace/your_dataset", name="subset1")   # default version when dataset_version=None is "first_version"
ds_v2 = load_dataset("your_namespace/your_dataset", name="subset1", dataset_version="second_version")

But this does not seem possible as BuilderConfigs have to have unique names. So my less elegant solution has been to name the BuilderConfigs of the second version differently.

ds_v1 = load_dataset("your_namespace/your_dataset", name="subset1")   # default version when dataset_version=None is "first_version"
ds_v2 = load_dataset("your_namespace/your_dataset", name="subset1_second_version")

Is there a more elegant solution than this, by any chance?

Also, I don’t seem to be able to pull up a specific BuilderConfig simply by setting dataset_version=“second_version”. It only seems to respond to name, passed through load_dataset. Thanks again for your help!