How to Use a Nested Python Dictionary in Dataset.from_dict

GSA · April 26, 2021, 12:21am

I have a nested python dictionary

create a nested dictionary

Dict = {‘train’: {‘id’: np.arange(len(train_texts)),
‘tokens’: train_texts,
‘tags’: train_tags},
‘val’: {‘id’: np.arange(len(val_texts)),
‘tokens’: val_texts,
‘tags’: val_tags},
‘test’: {‘id’: np.arange(len(test_texts)),
‘tokens’: test_texts,
‘tags’: test_tags}
}

My question how do I use the nested dictionary in transformers Dataset.from_dict() such that it gives me an output like the following:

DatasetDict({
train: Dataset({
features: [‘id’, ‘tokens’, ‘tags’],
num_rows: 6801
})
val: Dataset({
features: [‘id’, ‘tokens’, ‘tags’],
num_rows: 1480
})
test: Dataset({
features: [‘id’, ‘tokens’, ‘tags’],
num_rows: 1532
})
})

lewtun · April 26, 2021, 10:59am

hey @GSA, as far as i know you can’t create a DatasetDict object directly from a python dict, but you could try creating 3 Dataset objects (one for each split) and then add them to DatasetDict as follows:

dataset = DatasetDict()
# using your `Dict` object
for k,v in Dict.items():
    dataset[k] = Dataset.from_dict(v)

GSA · April 26, 2021, 4:39pm

@lewtun,

Thanks for your help. It worked.

Also, as a follow up question, how can I use the datasets.ClassLabel feature to specify a predefined set of classes which can have labels associated to them and be stored as integers in the dataset.

This field will be stored and retrieved as an integer value and two conversion methods, datasets.ClassLabel.str2int() and datasets.ClassLabel.int2str() can be used to convert from the label names to the associate integer value and vice-versa.

lewtun · April 26, 2021, 9:01pm

great that it worked!

i’m not sure i understand the question (what exactly do you want to know?), but judging by your first comment it seems that you’re doing named entity recognition. you can see in the datasets library how ClassLabel is used for this task e.g. here https://github.com/huggingface/datasets/blob/8e903b5ef7c039cee79f0a2da3dd328d59c38588/datasets/conll2003/conll2003.py#L167

GSA · April 27, 2021, 2:52am

@lewtun,

Yes, you are correct, I’m trying to work on a named entity recognition task. So what I was hoping to achieve was to be able to do something like below:

Dict = {'train': {'id': np.arange(len(train_texts)),
                  'tokens': train_texts,
                  'ner_tags': train_tags(
                                        datasets.features.ClassLabel(
                                            names=['B-ORG', 'I-ORG', 'O', 'I-EVENT', 'I-PERSON', 'B-PERSON']
                                        )
                                      )
                },
       'val': {'id': np.arange(len(train_texts)),
                  'tokens': train_texts,
                  'ner_tags': train_tags(
                                        datasets.features.ClassLabel(
                                            names=['B-ORG', 'I-ORG', 'O', 'I-EVENT', 'I-PERSON', 'B-PERSON']
                                        )
                                      )
                },
        'test': {'id': np.arange(len(train_texts)),
                  'tokens': train_texts,
                  'ner_tags': train_tags(
                                        datasets.features.ClassLabel(
                                            names=['B-ORG', 'I-ORG', 'O', 'I-EVENT', 'I-PERSON', 'B-PERSON']
                                        )
                                      )
                }
       }

Specifically, to include the specification of the ClassLabel in the python dictionary statement. Doing it that way does not work because it gives a 'list object is not callable error`. Is there a way to work around that?

lewtun · April 27, 2021, 12:51pm

one way you can do this is by explicitly specifying the features argument in the Dataset.from_dict method (docs), e.g. assume we have a dict with two examples:

from datasets import Dataset, ClassLabel, Sequence, Features, Value

d = {'id': ['0', '1'], 'ner_tags': [[3, 0, 7, 0, 0, 0, 7, 0, 0], [1, 2]], 'tokens': [['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['Peter', 'Blackburn']]}
# define number of tags and their names
tags = ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])
# create dataset
ds = Dataset.from_dict(mapping=d, features=Features({"ner_tags":Sequence(tags), 'id': Value(dtype='string'), 'tokens': Sequence(feature=Value(dtype='string'))}))
# access ClassLabel feature - 0 returns tag "O", 1 returns "B-PER" etc
ds.features["ner_tags"].feature.int2str(0)

you can consult the docs to see what role Sequence and Value play (tl;dr we have to explicitly define the datatypes for the underlying Arrow table)

GSA · April 27, 2021, 10:04pm

Thanks. That was extremely helpful.

Topic		Replies	Views
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2228	September 6, 2021
Creating a dataset with custom data Beginners	3	8757	September 5, 2022
Dataset Object without ClassLabel 🤗Datasets	3	1100	March 8, 2023
ValueError: Field 'ner_tags' from the JSON data of type list<item: string> is not compatible with ClassLabel. Compatible types are int64 and string 🤗Datasets	7	860	March 25, 2022
Sequence features - Class Label Cast_ 🤗Datasets	9	1315	July 4, 2023

How to Use a Nested Python Dictionary in Dataset.from_dict

create a nested dictionary

Related topics