Good day! Thank you very much for reading this question.
I am working on private dataset in local storage and I want to mimic the program that loads dataset with load_dataset(). In order not to modify the training loop, I would like to convert my private dataset into the exact format the online dataset is stored; so that after loading the dataset, it will have exact same behavior, i.e. having a DatasetDict object with 3 splits (train, validation and test) with feature ‘translation’ which contains two key value pairs in each row with key name as language code and value as sentence). The behavior is shown as below.
Would you please help me with
(1) the folder structure, naming of the files, and the data format
(2) how to call load_dataset so that it will return a DatasetDict with same behavior as below.
hey @jenniferL, to have the same behaviour as your example you’ll need to create a dataset loading script (see docs) which defines the configuration (de-en in your example), along with the column names and types. once your script is ready, you should be able to do something like:
Is it possible to skip the load_dataset() step and just convert a list of dict in python that we have created on our own to <class ‘datasets.arrow_dataset.Dataset’>.
For example, I loaded the imdb dataset using
raw_datasets = load_dataset(“imdb”)
Then type(raw_datasets[‘train’]) is a <class ‘datasets.arrow_dataset.Dataset’> which upon printing is a list of dictionaries like:
print(raw_datasets[‘train’][0]) gives -
{‘text’: ‘Bromwell High is a cartoon comedy. h. A classic line…What a pity that it isn’t!’, ‘label’: 1}
So if we have created a list of such dict in python and have a list for train and test cases, then is there a way to just convert these list of dict to <class ‘datasets.arrow_dataset.Dataset’>
Hi @vikasy95 yes you can create a Dataset object by using the from_dict() method, e.g.
from datasets import Dataset
data = {"text":["This is a positive sentence", "This is a negative sentence"], "label": [1,0]}
dset = Dataset.from_dict(data)