How to Use a Nested Python Dictionary in Dataset.from_dict

lewtun · April 27, 2021, 12:51pm

one way you can do this is by explicitly specifying the features argument in the Dataset.from_dict method (docs), e.g. assume we have a dict with two examples:

from datasets import Dataset, ClassLabel, Sequence, Features, Value

d = {'id': ['0', '1'], 'ner_tags': [[3, 0, 7, 0, 0, 0, 7, 0, 0], [1, 2]], 'tokens': [['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['Peter', 'Blackburn']]}
# define number of tags and their names
tags = ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])
# create dataset
ds = Dataset.from_dict(mapping=d, features=Features({"ner_tags":Sequence(tags), 'id': Value(dtype='string'), 'tokens': Sequence(feature=Value(dtype='string'))}))
# access ClassLabel feature - 0 returns tag "O", 1 returns "B-PER" etc
ds.features["ner_tags"].feature.int2str(0)

you can consult the docs to see what role Sequence and Value play (tl;dr we have to explicitly define the datatypes for the underlying Arrow table)

Topic		Replies	Views
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2228	September 6, 2021
Creating a dataset with custom data Beginners	3	8754	September 5, 2022
Dataset Object without ClassLabel 🤗Datasets	3	1100	March 8, 2023
ValueError: Field 'ner_tags' from the JSON data of type list<item: string> is not compatible with ClassLabel. Compatible types are int64 and string 🤗Datasets	7	860	March 25, 2022
Sequence features - Class Label Cast_ 🤗Datasets	9	1315	July 4, 2023

How to Use a Nested Python Dictionary in Dataset.from_dict

Related topics