How to upload multiple related datasets?


I have several related datasets. For example it can be: 1) users 2) products 3) product views made by users.

Should I create a separate repository for each dataset? Or it’s possible to upload them into a single repository?

The following code will overwrite users and products by user_products dataset:


Can I specify a separate subpath or name for each dataset?

super_glue · Datasets at Hugging Face repository contains subsets. I guess it’s what I need. But it’s based on a dataset loading script, that downloads data from some external sources. Is there a simpler example of a dataset with subsets based on plain Datasets, without downloading data from external sources and so on?

It seems that the only way to add subsets is to create a dataset loading script. I’ve created one.

Here is the fragment, maybe it will be useful for someone:

import datasets
import pyarrow.parquet as pq





_BASE_URL = ''

    'domains': datasets.Features({
        'reg_number': datasets.Value('string'),
        'standard_name': datasets.Value('string'),
        'name': datasets.Value('string'),
        'purpose': datasets.Value('string'),
        'embeddings': datasets.Sequence(datasets.Value('float32')),
    'generalized_functions': datasets.Features({
        'generalized_function_id': datasets.Value('string'),
        'reg_number': datasets.Value('string'),
        'name': datasets.Value('string'),
        'embeddings': datasets.Sequence(datasets.Value('float32')),

class ProfStandardsDatasetBuilder(datasets.ArrowBasedBuilder):

    VERSION = datasets.Version('0.0.1')

        datasets.BuilderConfig('domains', VERSION),
        datasets.BuilderConfig('generalized_functions', VERSION),

    DEFAULT_CONFIG_NAME = 'domains'

    def _info(self):
        return datasets.DatasetInfo(

    def _split_generators(self, dl_manager):
        url = _BASE_URL + + '.parquet'
        file_path = dl_manager.download_and_extract(url)
        return [
                gen_kwargs={'file_path': file_path},

    def _generate_tables(self, file_path):
        yield 0, pq.read_table(file_path)

Just store your datasets in parquet format:


And upload it manually to your repository.

It was not very hard, but I’m lost in documentation :slight_smile:

Some things to note:

  1. Dataset loading script shoud have the same name as the repository. For example,

  2. It’s better to use underscores in dataset names. Actually there are dash-named repositories on the hub, and they seem to work. But I had problems with dashes.

  3. You can use the following command to test your script locally:

datasets-cli test --save_infos --all_configs

And to generate dataset_info for However I don’t know whether dataset_info is required.

  1. You can use the following command to test your repository:
datasets-cli test 'AresEkb/prof_standards_sbert_large_mt_nlu_ru' --save_infos --all_configs
  1. Dataset card isn’t updated immediately
1 Like

Thanks for the notes !

We’re also working on a way to define which files go into which dataset configuration using simple YAML here: Support for multiple configs in packaged modules via metadata yaml info by polinaeterna · Pull Request #5331 · huggingface/datasets · GitHub

1 Like