How to prepare local dataset for load_dataset() and mimic its behavior when loading HF's existing online dataset

jenniferL · May 25, 2021, 10:32am

Good day! Thank you very much for reading this question.

I am working on private dataset in local storage and I want to mimic the program that loads dataset with load_dataset(). In order not to modify the training loop, I would like to convert my private dataset into the exact format the online dataset is stored; so that after loading the dataset, it will have exact same behavior, i.e. having a DatasetDict object with 3 splits (train, validation and test) with feature ‘translation’ which contains two key value pairs in each row with key name as language code and value as sentence). The behavior is shown as below.

Would you please help me with
(1) the folder structure, naming of the files, and the data format
(2) how to call load_dataset so that it will return a DatasetDict with same behavior as below.

raw_datasets = load_dataset("wmt17", "de-en")
print(raw_datasets)
'''
DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 5906184
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2999
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 3004
    })
})
'''
print(raw_datasets["train"][0])
'''
{'translation': {'de': 'Wiederaufnahme der Sitzungsperiode',
  'en': 'Resumption of the session'}}
'''

Thank you!

lewtun · May 25, 2021, 11:42am

hey @jenniferL, to have the same behaviour as your example you’ll need to create a dataset loading script (see docs) which defines the configuration (de-en in your example), along with the column names and types. once your script is ready, you should be able to do something like:

from datasets import load_dataset

dataset = load_dataset('PATH/TO/MY/SCRIPT.py', 'my_configuration', data_files={'train': 'my_train_file.txt', 'validation': 'my_validation_file.txt'})

tips:

you might need to hardcode data_files explicitly in your script to preserve the exact same signature you have for load_dataset in your example.
you might find this script template a useful place to start from

jenniferL · May 26, 2021, 6:00am

Hello @lewtun, Thank you very much for pointing out the direction.

I will try it out!

vikasy95 · January 24, 2022, 12:35am

Hi @lewtun ,

Is it possible to skip the load_dataset() step and just convert a list of dict in python that we have created on our own to <class ‘datasets.arrow_dataset.Dataset’>.

For example, I loaded the imdb dataset using
raw_datasets = load_dataset(“imdb”)

Then type(raw_datasets[‘train’]) is a <class ‘datasets.arrow_dataset.Dataset’> which upon printing is a list of dictionaries like:
print(raw_datasets[‘train’][0]) gives -
{‘text’: ‘Bromwell High is a cartoon comedy. h. A classic line…What a pity that it isn’t!’, ‘label’: 1}

So if we have created a list of such dict in python and have a list for train and test cases, then is there a way to just convert these list of dict to <class ‘datasets.arrow_dataset.Dataset’>

Thanks,
-V

lewtun · January 24, 2022, 3:53pm

Hi @vikasy95 yes you can create a Dataset object by using the from_dict() method, e.g.

from datasets import Dataset

data = {"text":["This is a positive sentence", "This is a negative sentence"], "label": [1,0]}
dset = Dataset.from_dict(data)

See the docs for more details

vikasy95 · January 25, 2022, 6:45am

Thanks a lot @lewtun, it works perfectly.

Thanking you,
-V

Topic		Replies	Views
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	1026	November 16, 2021
Custom Local Data Loading: generating split with load_dataset() not working: Values in `DatasetDict` should be of type `Dataset` but got type '<class 'datasets.dataset_dict.DatasetDict' 🤗Datasets	2	573	June 13, 2023
Fine tune a model from a script-based dataset 🤗Datasets	2	264	January 28, 2023
Create a dataset for translation 🤗Datasets	4	1397	December 14, 2023
Convert a list of dictionaries to hugging face dataset object 🤗Datasets	4	19607	December 7, 2023

How to prepare local dataset for load_dataset() and mimic its behavior when loading HF's existing online dataset

Related topics