ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`label` in this case) have excessive nesting (inputs typ

Good day every one!

I am following a blog to fine-tuning a topic classification model and got trapped here. I’ve read the relevant discussions and tried the listed solutions, but my problem is still unresolved.

here is my preprocess_function:

def preprocess_function(data):
    model_inputs = tokenizer(
        data["text"],
        padding="longest",
        truncation=True
    )

    label_map = {label: index for index, label in id2label.items()}
    labels = [label_map[l] for l in data["label"]]
    model_inputs["labels"] = labels
    return model_inputs

The result of preprocess_function() on test_dataset is okay, containing the desired fields and corresponding values. However, when applying the function to a single piece of data, there comes an error. For instance,
train_dataset[0]: {'text': 'time in germany', 'label': 'Clock'}

tokenized_sample = preprocess_function(train_dataset[0])
print(tokenized_sample)

Then I got the following KeyError:

KeyError                                  Traceback (most recent call last)
<ipython-input-110-860026a66a82> in <cell line: 1>()
----> 1 tokenized_sample = preprocess_function(train_dataset[0])
      2 print(tokenized_sample)
      3 print(f"Length of tokenized IDs: {len(tokenized_sample.input_ids)}")
      4 print(f"Length of attention mask: {len(tokenized_sample.attention_mask)}")

1 frames
<ipython-input-105-2b50729ea506> in <listcomp>(.0)
      7 
      8     label_map = {label: index for index, label in id2label.items()}
----> 9     labels = [label_map[l] for l in examples["label"]]
     10     # labels = tokenizer(labels, padding="longest", truncation=True)
     11     model_inputs["labels"] = labels

KeyError: 'C'

I assume there is something wrong with data type, but i am not sure how to fix it. And due to this error the following training code fails. The code is below:

BATCH_SIZE = 8
NUM_PROCS = 4

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
)
# Training arguments
OUT_DIR = "output"
LR = 1e-5
EPOCHS = 3

training_args = TrainingArguments(
    output_dir=OUT_DIR,
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=3,
    report_to='tensorboard',
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=None,  # No need for data collator as padding is handled in preprocess_function
    compute_metrics=None,  
)
trainer.train()

Many thanks for your help!

I think in case of a single example your labels is a string ‘Clock’. So when you iterate a for loop over it in list comprehension, it gives ‘C’ as your first key, which in turn gives the Key error.
Now you may be able to solve it yourself.

Thank you for your response.

To preprocess a dataset, I use labels = [label_map[l] for l in data["label"]] and when checking the tokenized value for a specific text, I replace the line with labels = label_map[data["label"]]. Now the code works. Below is the value of train_dataset[1]:

{'text': 'set an alarm to 4',
 'label': 'Clock',
 'input_ids': [101, 2275, 2019, 8598, 2000, 1018, 102, 0, 0, 0, 0, 0, 0],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
 'labels': 3}

Therefore I assume the data has been tokenized. But when I run trainer.train(), the value error occurs, which really confuses me.

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`label` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

I appreciate your further help.

Aah. Try and see the the output of the preprocessing function for say, train dataset[:8] or something. Then see if the error makes sense.

The inputs to the trainer should be batched tensors of sequences(input_ids, attention mask etc.) and targets. Here you are supplying a dictionary. Modify the outputs from your preprocessing func accordingly.