one way you can do this is by explicitly specifying the features
argument in the Dataset.from_dict
method (docs), e.g. assume we have a dict
with two examples:
from datasets import Dataset, ClassLabel, Sequence, Features, Value
d = {'id': ['0', '1'], 'ner_tags': [[3, 0, 7, 0, 0, 0, 7, 0, 0], [1, 2]], 'tokens': [['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['Peter', 'Blackburn']]}
# define number of tags and their names
tags = ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])
# create dataset
ds = Dataset.from_dict(mapping=d, features=Features({"ner_tags":Sequence(tags), 'id': Value(dtype='string'), 'tokens': Sequence(feature=Value(dtype='string'))}))
# access ClassLabel feature - 0 returns tag "O", 1 returns "B-PER" etc
ds.features["ner_tags"].feature.int2str(0)
you can consult the docs to see what role Sequence
and Value
play (tl;dr we have to explicitly define the datatypes for the underlying Arrow table)