Save `DatasetDict` to HuggingFace Hub

pietrolesci · November 19, 2021, 7:17pm

Hi there,

I prepared my data into a DatasetDict object that I saved to disk with the save_to_disk method. I’d like to upload the generated folder to the HuggingFace Hub and use it using the usual load_dataset function. Though, I have not yet found a way to do so. Is this possible?

Thanks a lot in advance for your help.

Best,
Pietro

mariosasko · November 22, 2021, 3:35pm

Hi,

this week’s release of datasets will add support for directly pushing a Dataset/DatasetDict object to the Hub. In the meantime, you can use a to_{format} method, where format is one of ["csv", "json", "txt", "parquet"] on each split of the DatasetDict object and push the generated files to the Hub (follow the docs here for more information). Also note that this requires the master version of the library, which you can install with:

pip install git+https://github.com/huggingface/datasets.git

Without the master version, you’ll have to specify a list of files to load each split separately (docs on that are here).

pietrolesci · November 23, 2021, 8:55pm

Hi @mariosasko,

Thanks a lot for your answer! I will try this out later and let you know how it goes. Excited about the new upcoming feature

Best,
Pietro

pierreguillou · December 6, 2021, 6:46pm

Hi @mariosasko,

I just followed the guide Upload from Python to push to the datasets hub a DatasetDict with train and validation Datasets inside.

raw_datasets = DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 10000000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1000000
    })
})

from huggingface_hub import notebook_login
notebook_login()

raw_datasets.push_to_hub(repo_id=dataset_name, private=True)

The DatasetDict.push_to_hub() works, and I have train and validation parquet files in my repository (in the folder data) but when I do a load_dataset(), I got a DatasetDict with only a Dataset train that has all the rows (11000000) from the original Dataset train (10000000) and Dataset validation (1000000) that were pushed.

import datasets
from datasets import load_dataset
raw_datasets = load_dataset(dataset_name, use_auth_token=True)
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 11000000
    })
})

Strange. How can I get my original DatasetDict with load_dataset()? Thanks.

pierreguillou · December 6, 2021, 7:23pm

@mariosasko, I guess my problem of concatenation between train and validation when using load_dataset() seems to be a normal behavior if no dataset loading script is created in the files on the HF dataset repository (check this post):

However, this information is not given in the HF doc Upload from Python about how to upload a datasets.DatasetDict on the Hugging Face Hub in Python.

The HF doc has to be updated or the DatasetDict.push_to_hub() has to be modified?
cc @lhoestq

lhoestq · December 7, 2021, 11:31am

Hi ! Since push_to_hub has been introduced, the dataset builder doesn’t concatenate everything anymore. Indeed now it takes into account the file names. Each file containing “validation” goes to the validation split for example. The supported patterns are explained in the docs

Make sure to update datasets to make sure you load each split independently

pierreguillou · December 7, 2021, 12:14pm

Hi @lhoestq,

You are right.

I was using the HF sagemaker notebooks where the datasets version is inferior to the 1.16 (often 1.13, cc @philschmid) where the DatasetDict.push_to_hub() method was introduced.

With this version (and the latest one 1.16.1), it works (I get the same DatasetDict with train and validation that the one I pushed to the HF datasets hub):

# !pip install datasets --upgrade
import datasets

print(datasets.__version__)
# 1.16.1

from datasets import load_dataset
raw_datasets = load_dataset(dataset_name, use_auth_token=True)
raw_datasets

raw_datasets = DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 10000000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1000000
    })
})

pierreguillou · December 10, 2021, 1:58am

Hi @lhoestq,

As already discussed, we need to use datasets>=1.16 in order to push_to_hub() and load_dataset() a DatasetDict(). This is very clear.

However, I checked if the features of the DatasetDict() are kept and it appears that not.

# download https://huggingface.co/datasets/lener_br
datasets = load_dataset('lener_br')

# check the features
datasets['train'].features

{'id': Value(dtype='string', id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=13, names=['O', 'B-ORGANIZACAO', 'I-ORGANIZACAO', 'B-PESSOA', 'I-PESSOA', 'B-TEMPO', 'I-TEMPO', 'B-LOCAL', 'I-LOCAL', 'B-LEGISLACAO', 'I-LEGISLACAO', 'B-JURISPRUDENCIA', 'I-JURISPRUDENCIA'], names_file=None, id=None), length=-1, id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

# connect to HF hub
from huggingface_hub import notebook_login
notebook_login()

# push this DatasetDict() to my HF profile as private
datasets.push_to_hub(repo_id='test_lener_br', private=True)

# download the pushed DatasetDict()
datasets = load_dataset('pierreguillou/test_lener_br', use_auth_token=API_TOKEN)

# check the features
datasets['train'].features

{'id': Value(dtype='string', id=None),
 'ner_tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

As you can see, I lost the ner_tags features of the original DatasetDict(). What do you think?

lhoestq · December 10, 2021, 11:02am

It looks like a bug, thanks for reporting !

I think others have had issues with ClassLabel recently (see here). We’re investigating what’s going on

stepkurniawan · October 19, 2023, 12:54pm

Sorry, I still don’t get the answer…
I have some dataset that i save using save_to_disk()
how to upload that to HF with a nice subset name?

mariosasko · October 19, 2023, 2:04pm

@stepkurniawan You can use config_name to assign the name to a subset (e.g. dataset_dict.push_to_hub(repo_id, config_name=config_name, ...))

stepkurniawan · October 19, 2023, 6:42pm

Hi Mario,
Thank you for answering.
Yes, and when I tried to load_dataset( “stepkurniawan/xxx”, “<the_config_name>” ) , the issue arises.
it somehow only returns me my old dataset that I have uploaded in the first time. see: Load_dataset() doesn’t load ONE of the Subset - Beginners - Hugging Face Forums

Please tell me when im wrong.
If you use push_to_hub(“something”, config_name=“configname”), when it exists already. Will the second time overwrites the first push?

lhoestq · October 20, 2023, 10:12am

Yes it overwrites it

Topic		Replies	Views
Pushing dataset images to Hub 🤗Datasets	4	2665	October 25, 2022
(Best) Way to add dataset card from Python (push_to_hub) 🤗Datasets	2	509	February 18, 2022
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2582	October 14, 2022
Datasetdict push_to_hub failing with payload to large 🤗Datasets	6	75	February 11, 2025
Problem pushing dataset to huggingface 🤗Datasets	11	3629	June 26, 2023

Save `DatasetDict` to HuggingFace Hub

Related topics