I have this dataset
text label
lorem ipsum positive
lorem ipsum positive
lorem ipsum positive
lorem ipsum positive
which I tokenize like this
def tokenize_function(example):
return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)
tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)
and now it looks like this:
Dataset({
features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1024
})
Questions:
1: Should the label column be tokenized? The task is multiclass classification
2: Should I remove the âtextâ variable from the dataset (which I use for training?
3: Other things I havenât thought about?