How to Use a Nested Python Dictionary in Dataset.from_dict

I have a nested python dictionary

create a nested dictionary

Dict = {‘train’: {‘id’: np.arange(len(train_texts)),
‘tokens’: train_texts,
‘tags’: train_tags},
‘val’: {‘id’: np.arange(len(val_texts)),
‘tokens’: val_texts,
‘tags’: val_tags},
‘test’: {‘id’: np.arange(len(test_texts)),
‘tokens’: test_texts,
‘tags’: test_tags}
}

My question how do I use the nested dictionary in transformers Dataset.from_dict() such that it gives me an output like the following:

DatasetDict({
train: Dataset({
features: [‘id’, ‘tokens’, ‘tags’],
num_rows: 6801
})
val: Dataset({
features: [‘id’, ‘tokens’, ‘tags’],
num_rows: 1480
})
test: Dataset({
features: [‘id’, ‘tokens’, ‘tags’],
num_rows: 1532
})
})

hey @GSA, as far as i know you can’t create a DatasetDict object directly from a python dict, but you could try creating 3 Dataset objects (one for each split) and then add them to DatasetDict as follows:

dataset = DatasetDict()
# using your `Dict` object
for k,v in Dict.items():
    dataset[k] = Dataset.from_dict(v)
2 Likes

@lewtun,

Thanks for your help. It worked.

Also, as a follow up question, how can I use the datasets.ClassLabel feature to specify a predefined set of classes which can have labels associated to them and be stored as integers in the dataset.

This field will be stored and retrieved as an integer value and two conversion methods, datasets.ClassLabel.str2int() and datasets.ClassLabel.int2str() can be used to convert from the label names to the associate integer value and vice-versa.

great that it worked!

i’m not sure i understand the question (what exactly do you want to know?), but judging by your first comment it seems that you’re doing named entity recognition. you can see in the datasets library how ClassLabel is used for this task e.g. here https://github.com/huggingface/datasets/blob/8e903b5ef7c039cee79f0a2da3dd328d59c38588/datasets/conll2003/conll2003.py#L167

@lewtun,

Yes, you are correct, I’m trying to work on a named entity recognition task. So what I was hoping to achieve was to be able to do something like below:

Dict = {'train': {'id': np.arange(len(train_texts)),
                  'tokens': train_texts,
                  'ner_tags': train_tags(
                                        datasets.features.ClassLabel(
                                            names=['B-ORG', 'I-ORG', 'O', 'I-EVENT', 'I-PERSON', 'B-PERSON']
                                        )
                                      )
                },
       'val': {'id': np.arange(len(train_texts)),
                  'tokens': train_texts,
                  'ner_tags': train_tags(
                                        datasets.features.ClassLabel(
                                            names=['B-ORG', 'I-ORG', 'O', 'I-EVENT', 'I-PERSON', 'B-PERSON']
                                        )
                                      )
                },
        'test': {'id': np.arange(len(train_texts)),
                  'tokens': train_texts,
                  'ner_tags': train_tags(
                                        datasets.features.ClassLabel(
                                            names=['B-ORG', 'I-ORG', 'O', 'I-EVENT', 'I-PERSON', 'B-PERSON']
                                        )
                                      )
                }
       }

Specifically, to include the specification of the ClassLabel in the python dictionary statement. Doing it that way does not work because it gives a 'list object is not callable error`. Is there a way to work around that?

one way you can do this is by explicitly specifying the features argument in the Dataset.from_dict method (docs), e.g. assume we have a dict with two examples:

from datasets import Dataset, ClassLabel, Sequence, Features, Value

d = {'id': ['0', '1'], 'ner_tags': [[3, 0, 7, 0, 0, 0, 7, 0, 0], [1, 2]], 'tokens': [['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['Peter', 'Blackburn']]}
# define number of tags and their names
tags = ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])
# create dataset
ds = Dataset.from_dict(mapping=d, features=Features({"ner_tags":Sequence(tags), 'id': Value(dtype='string'), 'tokens': Sequence(feature=Value(dtype='string'))}))
# access ClassLabel feature - 0 returns tag "O", 1 returns "B-PER" etc
ds.features["ner_tags"].feature.int2str(0)

you can consult the docs to see what role Sequence and Value play (tl;dr we have to explicitly define the datatypes for the underlying Arrow table)

2 Likes

Thanks. That was extremely helpful.

1 Like