Creating a dataset with custom data

Hey there, I’m trying to create a DatasetDict with two datasets(train and dev) for fine tuning a bart model.

I’ve created lists of source sentences, target sentences and id’s, they are lists of strings.

data = DatasetDict({
    "train": Dataset.from_dict({
        "id": train_idxs,
        "translation": {
            "source": train_inputs,
            "target": train_labels
        }
    }, features=Features({"id": Value(dtype='string'), "translation": {"source": Sequence, "target": Sequence}})),
    "dev": {
        "id": dev_idxs,
        "translation": {
            "source": dev_inputs,
            "target": dev_labels
        }
    }
})

is the code I’m using to create the DatasetDict, but I get error

TypeError: string indices must be integers

I want the object to have the same structure as the “Books” datasetdict that is used in this guide Translation

if anyone has any suggestions please let me know, as well as if I need to provide more information!

Thank you!

Hello! I don’t think that you should need to manually invoke DatasetDict. Instead, since you’re loading text data you can use one of the methods listed here: Load

If you have a bunch of CSV fileswith the headers source and target, you can load them with

from datasets import load_dataset
data = load_dataset("csv", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "validate": "my_test_3.txt"})

In general, I think it’s best to avoid DatasetDict and instead rely on the helper methods as much as possible.

Hope that helps!

Hi @edensn! You can use the following features dictionary instead of the current one to match the features:

Features({"id": Value(dtype='string'), "translation": Translation(languages=["source", "target"])})
1 Like

Thank you both of you guys for the suggestions! I actually figured it out using the method that @mariosasko suggested by just opening the Books dataset I was trying to mimic and doing just looking at books.features!

1 Like