Loosing my target variable when encoding

marlon89 · August 20, 2021, 2:06pm

Hello everyone,

I have created a Dataset Dictionary containing my training and testing dataset which looks like this:

DatasetDict({
    train: Dataset({
        features: ['Score', 'Review'],
        num_rows: 3014
    })
    test: Dataset({
        features: ['Score', 'Review'],
        num_rows: 754
    })
})

I want to fine-tune a Bert model for Sentiment Analysis. Therefore, I encode my Datasets:

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

def tokenize(batch):
    return tokenizer(batch['Review'], padding=True, truncation=True)

data_encoded = data_dict.map(tokenize, batched=True, batch_size=None)

data_encoded.set_format("torch", columns=["input_ids", "attention_mask", "Score"])

After encoding and setting the format, my target variable “Score” is missing in the torch.Tensor. Does anyone has an idea what the problem is over here? Does the Target variable has to be set to a specific format/datatype? I am having 5 different labels (from 0 to 4).

Cheers

Topic		Replies	Views
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1076	August 19, 2021
Data format for BertForSequenceClassification with num_labels > 2 Beginners	5	4916	August 2, 2021
Can I fine tune bert for a project where I have multiple text inputs and one label as output? Beginners	0	799	May 6, 2022
Model did not return a loss / BertForQuestionAnswering.forward() got an unexpected keyword argument 'labels' Beginners	0	528	October 19, 2023
Expected input batch_size (2048) to match target batch_size (4) Beginners	3	1603	May 23, 2022

Loosing my target variable when encoding

Related topics