How do I create new labels for a dataset?

Stationaryone · June 14, 2022, 2:57pm

I’m trying to fine-tune a model to do sentiment analysis using Keras/TensorFlow. I followed the exact code in Google Colab. However, instead of star-rating, I wanted only sentiment labels, “positive”, “negative”, and “neutral” (1, -1, and 0, respectively). So, during the tokenization, I mapped the star rating to a new “sentiment” field:

def tokenize_function(examples):
    examples['sentiment'] = []
    for x in examples['label']:
        if x > 3:
            examples['sentiment'].append(1)
        elif x < 3:
            examples['sentiment'].append(-1)
        else:
            examples['sentiment'].append(0)
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

However, I don’t know where/how to to tell the tokenized_datasets to use the new “sentiment” object for the correct labels. Maybe the DataCollator is used for that? But regardless, I don’t see any documentation on how to do that.

Topic		Replies	Views
Preprocessing data for text classification, HF dataset 🤗Datasets	1	571	October 3, 2022
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12835	February 12, 2024
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1074	August 19, 2021
Predicting with Token Classifier on data with no gold labels Beginners	1	1432	August 20, 2021
Label 2 id not working Beginners	1	180	June 12, 2025

How do I create new labels for a dataset?

Related topics