I’m using a custom dataset from a CSV file where the labels are strings. I’m curious what the best way to encode these labels to integers would be.
Sample code:
datasets = load_dataset('csv', data_files={
'train': 'train.csv',
'test': 'test.csv'
}
)
def tokenize(batch):
return tokenizer(batch['text'], padding=True, truncation=True, max_length=128)
datasets = datasets.map(tokenize, batched=True)
datasets.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])