How to upload multiple related datasets?

Hi

I have several related datasets. For example it can be: 1) users 2) products 3) product views made by users.

Should I create a separate repository for each dataset? Or it’s possible to upload them into a single repository?

The following code will overwrite users and products by user_products dataset:

users_dataset.push_to_hub('my/repo')
products_dataset.push_to_hub('my/repo')
user_products_dataset.push_to_hub('my/repo')

Can I specify a separate subpath or name for each dataset?

super_glue · Datasets at Hugging Face repository contains subsets. I guess it’s what I need. But it’s based on a dataset loading script, that downloads data from some external sources. Is there a simpler example of a dataset with subsets based on plain Datasets, without downloading data from external sources and so on?

It seems that the only way to add subsets is to create a dataset loading script. I’ve created one.

Here is the fragment, maybe it will be useful for someone:

import datasets
import pyarrow.parquet as pq

_CITATION = ''

_DESCRIPTION = ''

_HOMEPAGE = ''

_LICENSE = ''

_BASE_URL = 'https://huggingface.co/datasets/AresEkb/prof_standards_sbert_large_mt_nlu_ru/resolve/main/'

_FEATURES = {
    'domains': datasets.Features({
        'reg_number': datasets.Value('string'),
        'standard_name': datasets.Value('string'),
        'name': datasets.Value('string'),
        'purpose': datasets.Value('string'),
        'embeddings': datasets.Sequence(datasets.Value('float32')),
    }),
    'generalized_functions': datasets.Features({
        'generalized_function_id': datasets.Value('string'),
        'reg_number': datasets.Value('string'),
        'name': datasets.Value('string'),
        'embeddings': datasets.Sequence(datasets.Value('float32')),
    }),
}

class ProfStandardsDatasetBuilder(datasets.ArrowBasedBuilder):

    VERSION = datasets.Version('0.0.1')

    BUILDER_CONFIGS = [
        datasets.BuilderConfig('domains', VERSION),
        datasets.BuilderConfig('generalized_functions', VERSION),
    ]

    DEFAULT_CONFIG_NAME = 'domains'

    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=_FEATURES[self.config.name],
            homepage=_HOMEPAGE,
            license=_LICENSE,
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        url = _BASE_URL + self.config.name + '.parquet'
        file_path = dl_manager.download_and_extract(url)
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={'file_path': file_path},
            ),
        ]

    def _generate_tables(self, file_path):
        yield 0, pq.read_table(file_path)

Just store your datasets in parquet format:

domains_dataset.to_parquet('domains.parquet')

And upload it manually to your repository.

It was not very hard, but I’m lost in documentation :slight_smile:

Some things to note:

  1. Dataset loading script shoud have the same name as the repository. For example, prof_standards_sbert_large_mt_nlu_ru.py

  2. It’s better to use underscores in dataset names. Actually there are dash-named repositories on the hub, and they seem to work. But I had problems with dashes.

  3. You can use the following command to test your script locally:

datasets-cli test prof_standards_sbert_large_mt_nlu_ru.py --save_infos --all_configs

And to generate dataset_info for README.md. However I don’t know whether dataset_info is required.

  1. You can use the following command to test your repository:
datasets-cli test 'AresEkb/prof_standards_sbert_large_mt_nlu_ru' --save_infos --all_configs
  1. Dataset card isn’t updated immediately
2 Likes

Thanks for the notes !

We’re also working on a way to define which files go into which dataset configuration using simple YAML here: Support for multiple configs in packaged modules via metadata yaml info by polinaeterna · Pull Request #5331 · huggingface/datasets · GitHub

1 Like