Cannot Start the training loop because of bad size tokenization and/or for (presumably) custom dataset settings

Alechiove · June 11, 2022, 2:33pm

Hi! I’m new to the hugging face library, and I’m trying to do a fine tune for a sequence classification task on a cyberbullying dataset with 6 classes.

Originally, I wasn’t able to perform Training due to the tokenization of the dataset. It wasn’t performing in a good manner because of different dimensions for the tensor.

I started to change the code to try to perform properly the tokenization, and now I get this error:

“ValueError: too many values to unpack (expected 2)”

I think that, most probably, I manipulated in a bad manner the dataset(that I downloaded from another source).
Here’s what I’ve done:

def lowerWords(phrase):return phrase.lower()
def tokenize_sample(example):return tokenizer(example["tweet_text"], padding = "max_length", truncation = True, max_length = 512, return_tensors="pt")


train_path,validation_path,test_path = "./data/train_data.csv","./data/validation_data.csv","./data/test_data.csv"

labels_of_bullying = {'age' : 0, 'ethnicity' : 1, 'gender' : 2, 'not_cyberbullying' : 3, 'other_cyberbullying' : 4, 'religion' : 5}
keys_of_bullying = list(labels_of_bullying.keys())
datasets = {"train":train_path,"validation":validation_path,"test":test_path}
dataset = load_dataset("csv",data_files = datasets,cache_dir="./data")
dataset = dataset.rename_column(original_column_name="cyberbullying_type", new_column_name="labels")

train_dataset_label = ClassLabel(names = keys_of_bullying)
validation_dataset_label = ClassLabel(names = keys_of_bullying)
test_dataset_label = ClassLabel(names = keys_of_bullying)

dataset["train"] = dataset["train"].class_encode_column("labels",ClassLabel)
dataset["validation"] = dataset["validation"].class_encode_column("labels",ClassLabel)
dataset["test"] = dataset["test"].class_encode_column("labels",ClassLabel)

dataset = dataset.map(lambda x: {"tweet_text": lowerWords(x["tweet_text"])})
dataset = dataset.map(lambda x: tokenize_sample(x))

I’m adopting the “microsoft/MiniLM-L12-H384-uncased” checkpoint, and I’ve done a little change to the head of the model:

checkpoint = "microsoft/MiniLM-L12-H384-uncased"
tokenizer,model = AutoTokenizer.from_pretrained(checkpoint),AutoModelForSequenceClassification.from_pretrained(checkpoint)

model.classifier = nn.Sequential(
    nn.Linear(384,6),
    nn.Softmax(dim = -1)
)

I’m also attaching the arguments of the trainer class:

data_collator = DataCollatorWithPadding(tokenizer, return_tensors="pt")
train_args = TrainingArguments(
    output_dir = "./trainingResults",
    num_train_epochs = 2,
    learning_rate= 5e-5,
    per_device_train_batch_size = 32, per_device_eval_batch_size = 32
)
trainer = Trainer(
    model,
    train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    data_collator = data_collator
)
trainer.train()

Please, help! I’m really struggling on it
Thanks for any help in advance!

BramVanroy · June 11, 2022, 6:16pm

Can you please post the full error trace?

Alechiove · June 11, 2022, 6:30pm

Traceback (most recent call last):
  File "/Users/alessio/Desktop/LPT/codici_progetti/2/cyberbullying classifier/Dataset.py", line 117, in <module>
    trainer.train()
  File "/Users/alessio/opt/anaconda3/envs/LPT/lib/python3.8/site-packages/transformers/trainer.py", line 1317, in train
    return inner_training_loop(
  File "/Users/alessio/opt/anaconda3/envs/LPT/lib/python3.8/site-packages/transformers/trainer.py", line 1554, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/Users/alessio/opt/anaconda3/envs/LPT/lib/python3.8/site-packages/transformers/trainer.py", line 2183, in training_step
    loss = self.compute_loss(model, inputs)
  File "/Users/alessio/opt/anaconda3/envs/LPT/lib/python3.8/site-packages/transformers/trainer.py", line 2215, in compute_loss
    outputs = model(**inputs)
  File "/Users/alessio/opt/anaconda3/envs/LPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/alessio/opt/anaconda3/envs/LPT/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 1554, in forward
    outputs = self.bert(
  File "/Users/alessio/opt/anaconda3/envs/LPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/alessio/opt/anaconda3/envs/LPT/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 971, in forward
    batch_size, seq_length = input_shape
ValueError: too many values to unpack (expected 2)
  0%|

Topic		Replies	Views
ValueError: too many values to unpack (expected 2) or not enough values to unpack (expected 2, got 1). T5ForConditionalGeneration 🤗Transformers	0	172	May 23, 2024
ValueError: too many values to unpack (expected 2) in text summarization. Possibly due to nested lists? 🤗Transformers	1	1750	September 29, 2023
ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds 🤗Transformers	3	1761	November 14, 2023
Error while training a custom hugging face RoBERTa Models	0	88	June 26, 2024
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10377	August 10, 2023

Cannot Start the training loop because of bad size tokenization and/or for (presumably) custom dataset settings

Related topics