Dataset Object without ClassLabel

uygarkurt · March 4, 2023, 4:24pm

Hi everyone. I’m working on a sequence labelling task. During the training f1 scores are abnormally high for validation set. When I try inference it barely gets anythihg right. I thought it may be about how I used the dataset. I don’t get any errors by the way.

I create the dataset with from_list() function as follows:

train_dataset = Dataset.from_list(train_l)
valid_dataset = Dataset.from_list(valid_l)
test_dataset = Dataset.from_list(test_l)

where each element of train/test/val_l is dictionary in the following form:
{"tokens": [...], "tags":[...]}

After this I simply tokenize each dataset and align tags with the new tokens. When I run
print(tokenized_train.features)
Outputs:
{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

Where labels are the labels including special -100 and aligned with the byte-pair tokenization. Tags are simplly the tags used for tokens without alighment.

I think there’s something wrong in here. Documentation sasy these should be ClassLabel so that trainer can understand which values are the target values. On the other hand when I check documentation on loading datasets from local machine it doesn’t mention anything about it.

Can you spot anything wrong here? Should I set ClassLabel or do anything like that?

CamielPXNX · March 7, 2023, 10:16am

I have the same issue. Maybe I’m looking in the wrong places, but I can’t find much in the guide about casting a feature to ClassLabel when making your own dataset… In my case, I have string IOB labels labels to transform to ints.

EDIT: this clued me in to the solution: How to create custom ClassLabels?

uygarkurt · March 8, 2023, 9:52am

I solved it. What I did is basically:

dataset.features[“your_label”] = Sequence(feature=ClassLabel(names=names_you_want))

However I’m not sure if it did have any effect on training.

lhoestq · March 8, 2023, 12:21pm

To cast a feature to a sequence of ClassLabel you can use .cast_column():

ds = ds.cast_column("labels", Sequence(ClassLabel(names=label_names)))

Topic		Replies	Views
Add Sequence(feature=ClassLabel(...), ...) to an existing dataset 🤗Datasets	1	1622	May 2, 2022
Sequence features - Class Label Cast_ 🤗Datasets	9	1315	July 4, 2023
How to apply training ClassLabels on test / validation Dataset 🤗Datasets	2	372	September 20, 2023
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2224	September 6, 2021
ValueError: Field 'ner_tags' from the JSON data of type list<item: string> is not compatible with ClassLabel. Compatible types are int64 and string 🤗Datasets	7	860	March 25, 2022

Dataset Object without ClassLabel

Related topics