It seems that the only way to add subsets is to create a dataset loading script. I’ve created one.
Here is the fragment, maybe it will be useful for someone:
import datasets
import pyarrow.parquet as pq
_CITATION = ''
_DESCRIPTION = ''
_HOMEPAGE = ''
_LICENSE = ''
_BASE_URL = 'https://huggingface.co/datasets/AresEkb/prof_standards_sbert_large_mt_nlu_ru/resolve/main/'
_FEATURES = {
'domains': datasets.Features({
'reg_number': datasets.Value('string'),
'standard_name': datasets.Value('string'),
'name': datasets.Value('string'),
'purpose': datasets.Value('string'),
'embeddings': datasets.Sequence(datasets.Value('float32')),
}),
'generalized_functions': datasets.Features({
'generalized_function_id': datasets.Value('string'),
'reg_number': datasets.Value('string'),
'name': datasets.Value('string'),
'embeddings': datasets.Sequence(datasets.Value('float32')),
}),
}
class ProfStandardsDatasetBuilder(datasets.ArrowBasedBuilder):
VERSION = datasets.Version('0.0.1')
BUILDER_CONFIGS = [
datasets.BuilderConfig('domains', VERSION),
datasets.BuilderConfig('generalized_functions', VERSION),
]
DEFAULT_CONFIG_NAME = 'domains'
def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=_FEATURES[self.config.name],
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)
def _split_generators(self, dl_manager):
url = _BASE_URL + self.config.name + '.parquet'
file_path = dl_manager.download_and_extract(url)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={'file_path': file_path},
),
]
def _generate_tables(self, file_path):
yield 0, pq.read_table(file_path)
Just store your datasets in parquet format:
domains_dataset.to_parquet('domains.parquet')
And upload it manually to your repository.
It was not very hard, but I’m lost in documentation
Some things to note:
-
Dataset loading script shoud have the same name as the repository. For example, prof_standards_sbert_large_mt_nlu_ru.py
-
It’s better to use underscores in dataset names. Actually there are dash-named repositories on the hub, and they seem to work. But I had problems with dashes.
-
You can use the following command to test your script locally:
datasets-cli test prof_standards_sbert_large_mt_nlu_ru.py --save_infos --all_configs
And to generate dataset_info
for README.md. However I don’t know whether dataset_info
is required.
- You can use the following command to test your repository:
datasets-cli test 'AresEkb/prof_standards_sbert_large_mt_nlu_ru' --save_infos --all_configs
- Dataset card isn’t updated immediately