Cannot encode/tokenize my Dataset Dictionary

marlon89 · August 18, 2021, 1:46pm

Hello everyone,

I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed everything into a Dataset Dictionary:

#Creating Dataset Objects
dataset_train = datasets.Dataset.from_pandas(training_data)
dataset_test = datasets.Dataset.from_pandas(testing_data)

#Get rid of weird columns
dataset_train = dataset_train.remove_columns('__index_level_0__')
dataset_test = dataset_test.remove_columns('__index_level_0__')

#Create Dataset Dictionary
data_dict = datasets.DatasetDict({"train":dataset_train,"test":dataset_test})

I am transforming everything to a dataset dictionary cause I am following more or less a code and transfer it to my problem. Anyways, I am defining the function to tokenize:

from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score

num_labels = 5
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
batch_size = 16
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)


def tokenize(batch):
    return tokenizer(batch, padding=True, truncation=True)

and call the function with:

data_encoded = data_dict.map(tokenize, batched=True, batch_size=None)

I am getting this error after all this:

ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

What am I missing? Sorry I am completely new to the whole Huggingface infrastructure…

marlon89 · August 19, 2021, 7:55am

Found the error on my own as I had to specify the column which had to be tokenized. The correct Tokenizer function would be:

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

instead of

def tokenize(batch):
    return tokenizer(batch, padding=True, truncation=True)

Topic		Replies	Views
How to tokenize using map 🤗Datasets	4	6197	April 14, 2021
How can I use tokenized Dataset for Text Generation? Beginners	0	497	January 22, 2023
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12719	October 6, 2021
Receiving Error When trying to Tokenize Dataset with Distilbert Beginners	0	1949	August 28, 2022
Programmatic way to Tokenization on Custom Text Columns 🤗Tokenizers	0	568	June 27, 2022

Cannot encode/tokenize my Dataset Dictionary

Related topics