Preprocessing data for text classification, HF dataset

I have this dataset

text                        label
lorem ipsum          positive
lorem ipsum          positive
lorem ipsum          positive
lorem ipsum          positive

which I tokenize like this

def tokenize_function(example):
  return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets =, batched=True)

and now it looks like this:

    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1024

1: Should the label column be tokenized? The task is multiclass classification
2: Should I remove the ‘text’ variable from the dataset (which I use for training?
3: Other things I haven’t thought about?

Hi !

  1. Usually the “label” column must be an integer that is passed to the model. You can use
tokenized_datasets = tokenized_datasets.class_encode_column("label")

to automatically convert the column to integers. The mapping string<->integer can be found then at tokenized_datasets.features["label"]

  1. In general, models accept tokens as input (input_ids, token_type_ids, attention_mask), so you can drop the “text” column