Correct use of dataset.class_encode_column

david-waterworth · July 15, 2023, 1:48am

I have a dataset consisting of two fields (“text” and “label”) and tow splits (“train” and “test”). Both text and label are of type string.

I need to encode the labels, I have a large number of classes and I need to discover them at train time, the follow code works fine:

label_encoder = LabelEncoder()
labels = list(chain(
    (label for label in dataset['train']['label']),
    (label for label in dataset['test']['label'])))
label_encoder.fit(labels)
num_classes = len(label_encoder.classes_)
id2label = {id: label 
            for id, label in enumerate(label_encoder.classes_.tolist())}
label2id = {label: id 
            for (id, label) in id2label.items()}

def tokenize(batch):
    tokens = tokenizer(batch["text"], truncation=True, max_length=512)
    tokens['label'] = [label2id[label] for label in batch['label']]

    return tokens

I’m aware that datasets has built in functionality for “categorical” columns but I cannot get it to work - I have a feeling that its assigning different ids to the same label in the test and train set, i.e. my code is

dataset = dataset.class_encode_column("label")
num_classes = dataset['train'].features['label'].num_classes
id2label = {id:dataset['train'].features['label'].int2str(id) for id in range(num_classes)}
label2id = {label:id for (id,label) in id2label.items()}

With this code I removed the label encoding from my tokenize function, i.e.

def tokenize(batch):
    tokens = tokenizer(batch["text"], truncation=True, max_length=512)
    return tokens

In both cases I initialise the model as follows:

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=num_classes, id2label=id2label, label2id=label2id
)

In the first case I get 73% accuracy for Epoch 1 (evaluated on the test set). When I use class_encode_column I get 2%. If I train and evaluate using only the test set I get much higher accuracy. This implies to me that either it’s encoding both test and train independently or I’m doing something else wrong?

Can you please advise how to properly create a ClassLabel for a dataset that is already split? And when training a model with a ClassLabel as the target is there any additional considerations?

lhoestq · July 17, 2023, 1:22pm

Hi ! for all the splits to have the same labels you can do

dataset["train"] = dataset["train"].class_encode_column("label")
class_label_feature = dataset["train"].features["label"]

dataset["test"] = dataset["test"].cast_column("label", class_label_feature)

Topic		Replies	Views
Sequence features - Class Label Cast_ 🤗Datasets	9	1314	July 4, 2023
How to apply training ClassLabels on test / validation Dataset 🤗Datasets	2	371	September 20, 2023
Add Sequence(feature=ClassLabel(...), ...) to an existing dataset 🤗Datasets	1	1622	May 2, 2022
Class Labels for Custom Datasets 🤗Datasets	4	17948	June 2, 2022
Preprocessing data for text classification, HF dataset 🤗Datasets	1	572	October 3, 2022

Correct use of dataset.class_encode_column

Related topics