Hi! I’m new to the hugging face library, and I’m trying to do a fine tune for a sequence classification task on a cyberbullying dataset with 6 classes.
Originally, I wasn’t able to perform Training due to the tokenization of the dataset. It wasn’t performing in a good manner because of different dimensions for the tensor.
I started to change the code to try to perform properly the tokenization, and now I get this error:
“ValueError: too many values to unpack (expected 2)”
I think that, most probably, I manipulated in a bad manner the dataset(that I downloaded from another source).
Here’s what I’ve done:
def lowerWords(phrase):return phrase.lower()
def tokenize_sample(example):return tokenizer(example["tweet_text"], padding = "max_length", truncation = True, max_length = 512, return_tensors="pt")
train_path,validation_path,test_path = "./data/train_data.csv","./data/validation_data.csv","./data/test_data.csv"
labels_of_bullying = {'age' : 0, 'ethnicity' : 1, 'gender' : 2, 'not_cyberbullying' : 3, 'other_cyberbullying' : 4, 'religion' : 5}
keys_of_bullying = list(labels_of_bullying.keys())
datasets = {"train":train_path,"validation":validation_path,"test":test_path}
dataset = load_dataset("csv",data_files = datasets,cache_dir="./data")
dataset = dataset.rename_column(original_column_name="cyberbullying_type", new_column_name="labels")
train_dataset_label = ClassLabel(names = keys_of_bullying)
validation_dataset_label = ClassLabel(names = keys_of_bullying)
test_dataset_label = ClassLabel(names = keys_of_bullying)
dataset["train"] = dataset["train"].class_encode_column("labels",ClassLabel)
dataset["validation"] = dataset["validation"].class_encode_column("labels",ClassLabel)
dataset["test"] = dataset["test"].class_encode_column("labels",ClassLabel)
dataset = dataset.map(lambda x: {"tweet_text": lowerWords(x["tweet_text"])})
dataset = dataset.map(lambda x: tokenize_sample(x))
I’m adopting the “microsoft/MiniLM-L12-H384-uncased” checkpoint, and I’ve done a little change to the head of the model:
checkpoint = "microsoft/MiniLM-L12-H384-uncased"
tokenizer,model = AutoTokenizer.from_pretrained(checkpoint),AutoModelForSequenceClassification.from_pretrained(checkpoint)
model.classifier = nn.Sequential(
nn.Linear(384,6),
nn.Softmax(dim = -1)
)
I’m also attaching the arguments of the trainer class:
data_collator = DataCollatorWithPadding(tokenizer, return_tensors="pt")
train_args = TrainingArguments(
output_dir = "./trainingResults",
num_train_epochs = 2,
learning_rate= 5e-5,
per_device_train_batch_size = 32, per_device_eval_batch_size = 32
)
trainer = Trainer(
model,
train_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
tokenizer=tokenizer,
data_collator = data_collator
)
trainer.train()
Please, help! I’m really struggling on it
Thanks for any help in advance!