Cannot encode/tokenize my Dataset Dictionary

Hello everyone,

I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed everything into a Dataset Dictionary:

#Creating Dataset Objects
dataset_train = datasets.Dataset.from_pandas(training_data)
dataset_test = datasets.Dataset.from_pandas(testing_data)

#Get rid of weird columns
dataset_train = dataset_train.remove_columns('__index_level_0__')
dataset_test = dataset_test.remove_columns('__index_level_0__')

#Create Dataset Dictionary
data_dict = datasets.DatasetDict({"train":dataset_train,"test":dataset_test})

I am transforming everything to a dataset dictionary cause I am following more or less a code and transfer it to my problem. Anyways, I am defining the function to tokenize:

from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score

num_labels = 5
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
batch_size = 16
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)


def tokenize(batch):
    return tokenizer(batch, padding=True, truncation=True)

and call the function with:

data_encoded = data_dict.map(tokenize, batched=True, batch_size=None)

I am getting this error after all this:

ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

What am I missing? Sorry I am completely new to the whole Huggingface infrastructure…

Found the error on my own as I had to specify the column which had to be tokenized. The correct Tokenizer function would be:

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

instead of

def tokenize(batch):
    return tokenizer(batch, padding=True, truncation=True)