Save `DatasetDict` to HuggingFace Hub

Hi there,

I prepared my data into a DatasetDict object that I saved to disk with the save_to_disk method. I’d like to upload the generated folder to the HuggingFace Hub and use it using the usual load_dataset function. Though, I have not yet found a way to do so. Is this possible?

Thanks a lot in advance for your help.

Best,
Pietro

Hi,

this week’s release of datasets will add support for directly pushing a Dataset/DatasetDict object to the Hub. In the meantime, you can use a to_{format} method, where format is one of ["csv", "json", "txt", "parquet"] on each split of the DatasetDict object and push the generated files to the Hub (follow the docs here for more information). Also note that this requires the master version of the library, which you can install with:

pip install git+https://github.com/huggingface/datasets.git

Without the master version, you’ll have to specify a list of files to load each split separately (docs on that are here).

1 Like

Hi @mariosasko,

Thanks a lot for your answer! I will try this out later and let you know how it goes. Excited about the new upcoming feature :slight_smile:

Best,
Pietro

Hi @mariosasko,

I just followed the guide Upload from Python to push to the datasets hub a DatasetDict with train and validation Datasets inside.

raw_datasets = DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 10000000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1000000
    })
})

from huggingface_hub import notebook_login
notebook_login()

raw_datasets.push_to_hub(repo_id=dataset_name, private=True)

The DatasetDict.push_to_hub() works, and I have train and validation parquet files in my repository (in the folder data) but when I do a load_dataset(), I got a DatasetDict with only a Dataset train that has all the rows (11000000) from the original Dataset train (10000000) and Dataset validation (1000000) that were pushed.

import datasets
from datasets import load_dataset
raw_datasets = load_dataset(dataset_name, use_auth_token=True)
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 11000000
    })
})

Strange. How can I get my original DatasetDict with load_dataset()? Thanks.

@mariosasko, I guess my problem of concatenation between train and validation when using load_dataset() seems to be a normal behavior if no dataset loading script is created in the files on the HF dataset repository (check this post):

However, this information is not given in the HF doc Upload from Python about how to upload a datasets.DatasetDict on the Hugging Face Hub in Python.

The HF doc has to be updated or the DatasetDict.push_to_hub() has to be modified?
cc @lhoestq

Hi ! Since push_to_hub has been introduced, the dataset builder doesn’t concatenate everything anymore. Indeed now it takes into account the file names. Each file containing “validation” goes to the validation split for example. The supported patterns are explained in the docs

Make sure to update datasets to make sure you load each split independently :wink:

1 Like

Hi @lhoestq,

You are right.

I was using the HF sagemaker notebooks where the datasets version is inferior to the 1.16 (often 1.13, cc @philschmid) where the DatasetDict.push_to_hub() method was introduced.

With this version (and the latest one 1.16.1), it works (I get the same DatasetDict with train and validation that the one I pushed to the HF datasets hub):

# !pip install datasets --upgrade
import datasets

print(datasets.__version__)
# 1.16.1

from datasets import load_dataset
raw_datasets = load_dataset(dataset_name, use_auth_token=True)
raw_datasets

raw_datasets = DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 10000000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1000000
    })
})
1 Like

Hi @lhoestq,

As already discussed, we need to use datasets>=1.16 in order to push_to_hub() and load_dataset() a DatasetDict(). This is very clear.

However, I checked if the features of the DatasetDict() are kept and it appears that not.

# download https://huggingface.co/datasets/lener_br
datasets = load_dataset('lener_br')

# check the features
datasets['train'].features

{'id': Value(dtype='string', id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=13, names=['O', 'B-ORGANIZACAO', 'I-ORGANIZACAO', 'B-PESSOA', 'I-PESSOA', 'B-TEMPO', 'I-TEMPO', 'B-LOCAL', 'I-LOCAL', 'B-LEGISLACAO', 'I-LEGISLACAO', 'B-JURISPRUDENCIA', 'I-JURISPRUDENCIA'], names_file=None, id=None), length=-1, id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

# connect to HF hub
from huggingface_hub import notebook_login
notebook_login()

# push this DatasetDict() to my HF profile as private
datasets.push_to_hub(repo_id='test_lener_br', private=True)

# download the pushed DatasetDict()
datasets = load_dataset('pierreguillou/test_lener_br', use_auth_token=API_TOKEN)

# check the features
datasets['train'].features

{'id': Value(dtype='string', id=None),
 'ner_tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

As you can see, I lost the ner_tags features of the original DatasetDict(). What do you think?

It looks like a bug, thanks for reporting !

I think others have had issues with ClassLabel recently (see here). We’re investigating what’s going on :slight_smile:

1 Like