Preprocessing data for text classification, HF dataset

jorgenhw · October 3, 2022, 9:31am

I have this dataset

text                        label
lorem ipsum          positive
lorem ipsum          positive
lorem ipsum          positive
lorem ipsum          positive

which I tokenize like this

def tokenize_function(example):
  return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)

and now it looks like this:

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1024
})

Questions:
1: Should the label column be tokenized? The task is multiclass classification
2: Should I remove the ‘text’ variable from the dataset (which I use for training?
3: Other things I haven’t thought about?

lhoestq · October 3, 2022, 4:23pm

Hi !

Usually the “label” column must be an integer that is passed to the model. You can use

tokenized_datasets = tokenized_datasets.class_encode_column("label")

to automatically convert the column to integers. The mapping string<->integer can be found then at tokenized_datasets.features["label"]

In general, models accept tokens as input (input_ids, token_type_ids, attention_mask), so you can drop the “text” column

Topic		Replies	Views
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1078	August 19, 2021
Correct use of dataset.class_encode_column 🤗Datasets	1	2526	July 17, 2023
Shape mismatch between labels and logits 🤗Transformers	1	1686	December 27, 2023
No labels column for tokenized data 🤗Tokenizers	2	2231	June 27, 2022
Understanding multi-label classification training Beginners	0	820	February 14, 2023

Preprocessing data for text classification, HF dataset

Related topics