Hello everyone,
I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed everything into a Dataset Dictionary:
#Creating Dataset Objects
dataset_train = datasets.Dataset.from_pandas(training_data)
dataset_test = datasets.Dataset.from_pandas(testing_data)
#Get rid of weird columns
dataset_train = dataset_train.remove_columns('__index_level_0__')
dataset_test = dataset_test.remove_columns('__index_level_0__')
#Create Dataset Dictionary
data_dict = datasets.DatasetDict({"train":dataset_train,"test":dataset_test})
I am transforming everything to a dataset dictionary cause I am following more or less a code and transfer it to my problem. Anyways, I am defining the function to tokenize:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score
num_labels = 5
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
batch_size = 16
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
def tokenize(batch):
return tokenizer(batch, padding=True, truncation=True)
and call the function with:
data_encoded = data_dict.map(tokenize, batched=True, batch_size=None)
I am getting this error after all this:
ValueError: text input must of type str
(single example), List[str]
(batch or single pretokenized example) or List[List[str]]
(batch of pretokenized examples).
What am I missing? Sorry I am completely new to the whole Huggingface infrastructure…