ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`label` in this case) have excessive nesting (inputs typ

macadamia1234 · March 3, 2024, 5:39am

Good day every one!

I am following a blog to fine-tuning a topic classification model and got trapped here. I’ve read the relevant discussions and tried the listed solutions, but my problem is still unresolved.

here is my preprocess_function:

def preprocess_function(data):
    model_inputs = tokenizer(
        data["text"],
        padding="longest",
        truncation=True
    )

    label_map = {label: index for index, label in id2label.items()}
    labels = [label_map[l] for l in data["label"]]
    model_inputs["labels"] = labels
    return model_inputs

The result of preprocess_function() on test_dataset is okay, containing the desired fields and corresponding values. However, when applying the function to a single piece of data, there comes an error. For instance,
train_dataset[0]: {'text': 'time in germany', 'label': 'Clock'}

tokenized_sample = preprocess_function(train_dataset[0])
print(tokenized_sample)

Then I got the following KeyError:

KeyError                                  Traceback (most recent call last)
<ipython-input-110-860026a66a82> in <cell line: 1>()
----> 1 tokenized_sample = preprocess_function(train_dataset[0])
      2 print(tokenized_sample)
      3 print(f"Length of tokenized IDs: {len(tokenized_sample.input_ids)}")
      4 print(f"Length of attention mask: {len(tokenized_sample.attention_mask)}")

1 frames
<ipython-input-105-2b50729ea506> in <listcomp>(.0)
      7 
      8     label_map = {label: index for index, label in id2label.items()}
----> 9     labels = [label_map[l] for l in examples["label"]]
     10     # labels = tokenizer(labels, padding="longest", truncation=True)
     11     model_inputs["labels"] = labels

KeyError: 'C'

I assume there is something wrong with data type, but i am not sure how to fix it. And due to this error the following training code fails. The code is below:

BATCH_SIZE = 8
NUM_PROCS = 4

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
)
# Training arguments
OUT_DIR = "output"
LR = 1e-5
EPOCHS = 3

training_args = TrainingArguments(
    output_dir=OUT_DIR,
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=3,
    report_to='tensorboard',
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=None,  # No need for data collator as padding is handled in preprocess_function
    compute_metrics=None,  
)
trainer.train()

Many thanks for your help!

Sandy1857 · March 3, 2024, 7:05pm

I think in case of a single example your labels is a string ‘Clock’. So when you iterate a for loop over it in list comprehension, it gives ‘C’ as your first key, which in turn gives the Key error.
Now you may be able to solve it yourself.

macadamia1234 · March 4, 2024, 2:22pm

Thank you for your response.

To preprocess a dataset, I use labels = [label_map[l] for l in data["label"]] and when checking the tokenized value for a specific text, I replace the line with labels = label_map[data["label"]]. Now the code works. Below is the value of train_dataset[1]:

{'text': 'set an alarm to 4',
 'label': 'Clock',
 'input_ids': [101, 2275, 2019, 8598, 2000, 1018, 102, 0, 0, 0, 0, 0, 0],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
 'labels': 3}

Therefore I assume the data has been tokenized. But when I run trainer.train(), the value error occurs, which really confuses me.

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`label` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

I appreciate your further help.

Sandy1857 · March 4, 2024, 3:47pm

Aah. Try and see the the output of the preprocessing function for say, train dataset[:8] or something. Then see if the error makes sense.

The inputs to the trainer should be batched tensors of sequences(input_ids, attention mask etc.) and targets. Here you are supplying a dictionary. Modify the outputs from your preprocessing func accordingly.

Topic		Replies	Views
ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	1	7093	January 26, 2023
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ 🤗Transformers	1	727	November 22, 2023
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	4	33844	January 13, 2025
ValueError: Unable to create tensor for 1 dataset but not the other of same type 🤗Tokenizers	1	985	March 23, 2022
Error in Model.prepare_tf_dataset() 🤗Transformers	1	676	July 5, 2023

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`label` in this case) have excessive nesting (inputs typ

Related topics