Creating a dataset with custom data

edensn · September 2, 2022, 4:26pm

Hey there, I’m trying to create a DatasetDict with two datasets(train and dev) for fine tuning a bart model.

I’ve created lists of source sentences, target sentences and id’s, they are lists of strings.

data = DatasetDict({
    "train": Dataset.from_dict({
        "id": train_idxs,
        "translation": {
            "source": train_inputs,
            "target": train_labels
        }
    }, features=Features({"id": Value(dtype='string'), "translation": {"source": Sequence, "target": Sequence}})),
    "dev": {
        "id": dev_idxs,
        "translation": {
            "source": dev_inputs,
            "target": dev_labels
        }
    }
})

is the code I’m using to create the DatasetDict, but I get error

TypeError: string indices must be integers

I want the object to have the same structure as the “Books” datasetdict that is used in this guide Translation

if anyone has any suggestions please let me know, as well as if I need to provide more information!

Thank you!

NimaBoscarino · September 2, 2022, 11:48pm

Hello! I don’t think that you should need to manually invoke DatasetDict. Instead, since you’re loading text data you can use one of the methods listed here: Load

If you have a bunch of CSV fileswith the headers source and target, you can load them with

from datasets import load_dataset
data = load_dataset("csv", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "validate": "my_test_3.txt"})

In general, I think it’s best to avoid DatasetDict and instead rely on the helper methods as much as possible.

Hope that helps!

mariosasko · September 5, 2022, 10:51am

Hi @edensn! You can use the following features dictionary instead of the current one to match the features:

Features({"id": Value(dtype='string'), "translation": Translation(languages=["source", "target"])})

edensn · September 5, 2022, 1:57pm

Thank you both of you guys for the suggestions! I actually figured it out using the method that @mariosasko suggested by just opening the Books dataset I was trying to mimic and doing just looking at books.features!

Topic		Replies	Views
Correct way to create a Dataset from a csv file Beginners	13	14025	March 25, 2022
How to prepare local dataset for load_dataset() and mimic its behavior when loading HF's existing online dataset Beginners	5	1463	January 25, 2022
Create a dataset for translation 🤗Datasets	4	1398	December 14, 2023
How to Use a Nested Python Dictionary in Dataset.from_dict Beginners	6	6401	April 27, 2021
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2218	September 6, 2021

Creating a dataset with custom data

Related topics