How to prepare local dataset for load_dataset() and mimic its behavior when loading HF's existing online dataset

Good day! Thank you very much for reading this question.

I am working on private dataset in local storage and I want to mimic the program that loads dataset with load_dataset(). In order not to modify the training loop, I would like to convert my private dataset into the exact format the online dataset is stored; so that after loading the dataset, it will have exact same behavior, i.e. having a DatasetDict object with 3 splits (train, validation and test) with feature ‘translation’ which contains two key value pairs in each row with key name as language code and value as sentence). The behavior is shown as below.

Would you please help me with
(1) the folder structure, naming of the files, and the data format
(2) how to call load_dataset so that it will return a DatasetDict with same behavior as below.

raw_datasets = load_dataset("wmt17", "de-en")
print(raw_datasets)
'''
DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 5906184
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2999
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 3004
    })
})
'''
print(raw_datasets["train"][0])
'''
{'translation': {'de': 'Wiederaufnahme der Sitzungsperiode',
  'en': 'Resumption of the session'}}
'''

Thank you! :hugs:

hey @jenniferL, to have the same behaviour as your example you’ll need to create a dataset loading script (see docs) which defines the configuration (de-en in your example), along with the column names and types. once your script is ready, you should be able to do something like:

from datasets import load_dataset

dataset = load_dataset('PATH/TO/MY/SCRIPT.py', 'my_configuration', data_files={'train': 'my_train_file.txt', 'validation': 'my_validation_file.txt'})

tips:

  • you might need to hardcode data_files explicitly in your script to preserve the exact same signature you have for load_dataset in your example.
  • you might find this script template a useful place to start from
1 Like

Hello @lewtun, Thank you very much for pointing out the direction. :hugs:

I will try it out!

1 Like

Hi @lewtun ,

Is it possible to skip the load_dataset() step and just convert a list of dict in python that we have created on our own to <class ‘datasets.arrow_dataset.Dataset’>.

For example, I loaded the imdb dataset using
raw_datasets = load_dataset(“imdb”)

Then type(raw_datasets[‘train’]) is a <class ‘datasets.arrow_dataset.Dataset’> which upon printing is a list of dictionaries like:
print(raw_datasets[‘train’][0]) gives -
{‘text’: ‘Bromwell High is a cartoon comedy. h. A classic line…What a pity that it isn’t!’, ‘label’: 1}

So if we have created a list of such dict in python and have a list for train and test cases, then is there a way to just convert these list of dict to <class ‘datasets.arrow_dataset.Dataset’>

Thanks,
-V

Hi @vikasy95 yes you can create a Dataset object by using the from_dict() method, e.g.

from datasets import Dataset

data = {"text":["This is a positive sentence", "This is a negative sentence"], "label": [1,0]}
dset = Dataset.from_dict(data)

See the docs for more details :slight_smile:

1 Like

Thanks a lot @lewtun, it works perfectly.

Thanking you,
-V

1 Like