Dataset Object without ClassLabel

Hi everyone. I’m working on a sequence labelling task. During the training f1 scores are abnormally high for validation set. When I try inference it barely gets anythihg right. I thought it may be about how I used the dataset. I don’t get any errors by the way.

I create the dataset with from_list() function as follows:

train_dataset = Dataset.from_list(train_l)
valid_dataset = Dataset.from_list(valid_l)
test_dataset = Dataset.from_list(test_l)

where each element of train/test/val_l is dictionary in the following form:
{"tokens": [...], "tags":[...]}

After this I simply tokenize each dataset and align tags with the new tokens. When I run
print(tokenized_train.features)
Outputs:
{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

Where labels are the labels including special -100 and aligned with the byte-pair tokenization. Tags are simplly the tags used for tokens without alighment.

I think there’s something wrong in here. Documentation sasy these should be ClassLabel so that trainer can understand which values are the target values. On the other hand when I check documentation on loading datasets from local machine it doesn’t mention anything about it.

Can you spot anything wrong here? Should I set ClassLabel or do anything like that?

I have the same issue. Maybe I’m looking in the wrong places, but I can’t find much in the guide about casting a feature to ClassLabel when making your own dataset… In my case, I have string IOB labels labels to transform to ints.

EDIT: this clued me in to the solution: How to create custom ClassLabels?

I solved it. What I did is basically:

dataset.features[“your_label”] = Sequence(feature=ClassLabel(names=names_you_want))

However I’m not sure if it did have any effect on training.

1 Like

To cast a feature to a sequence of ClassLabel you can use .cast_column():

ds = ds.cast_column("labels", Sequence(ClassLabel(names=label_names)))