Correct use of dataset.class_encode_column

I have a dataset consisting of two fields (“text” and “label”) and tow splits (“train” and “test”). Both text and label are of type string.

I need to encode the labels, I have a large number of classes and I need to discover them at train time, the follow code works fine:

label_encoder = LabelEncoder()
labels = list(chain(
    (label for label in dataset['train']['label']),
    (label for label in dataset['test']['label'])))
label_encoder.fit(labels)
num_classes = len(label_encoder.classes_)
id2label = {id: label 
            for id, label in enumerate(label_encoder.classes_.tolist())}
label2id = {label: id 
            for (id, label) in id2label.items()}

def tokenize(batch):
    tokens = tokenizer(batch["text"], truncation=True, max_length=512)
    tokens['label'] = [label2id[label] for label in batch['label']]

    return tokens

I’m aware that datasets has built in functionality for “categorical” columns but I cannot get it to work - I have a feeling that its assigning different ids to the same label in the test and train set, i.e. my code is

dataset = dataset.class_encode_column("label")
num_classes = dataset['train'].features['label'].num_classes
id2label = {id:dataset['train'].features['label'].int2str(id) for id in range(num_classes)}
label2id = {label:id for (id,label) in id2label.items()}

With this code I removed the label encoding from my tokenize function, i.e.

def tokenize(batch):
    tokens = tokenizer(batch["text"], truncation=True, max_length=512)
    return tokens

In both cases I initialise the model as follows:

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=num_classes, id2label=id2label, label2id=label2id
)

In the first case I get 73% accuracy for Epoch 1 (evaluated on the test set). When I use class_encode_column I get 2%. If I train and evaluate using only the test set I get much higher accuracy. This implies to me that either it’s encoding both test and train independently or I’m doing something else wrong?

Can you please advise how to properly create a ClassLabel for a dataset that is already split? And when training a model with a ClassLabel as the target is there any additional considerations?

1 Like

Hi ! for all the splits to have the same labels you can do

dataset["train"] = dataset["train"].class_encode_column("label")
class_label_feature = dataset["train"].features["label"]

dataset["test"] = dataset["test"].cast_column("label", class_label_feature)
1 Like