DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

paulopirozelli · July 4, 2023, 3:13pm

I’m trying to train a number of transformer models on a classification task. My dataset has only two columns: text and label.

As part of the pre-processing, I tokenize, pad, and truncate the texts. The input for this function is a datasets.dataset_dict.DatasetDict object, where ‘checkpoint’ refers to the transformer model been trained and ‘dataset_dict’ refers to the pandas dataframe that is been tokenized.

The pre-processing is done with the following script.

from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader

def tokenizer_padding(dataset_dict, checkpoint, batch_size):
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='max_length')

    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True)

    tokenized_dataset = dataset_dict.map(tokenize_function, batched=True)
    tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

    train_dataloader = DataLoader(
        tokenized_dataset["train"], shuffle=True, batch_size=batch_size, collate_fn=data_collator
    )
    validation_dataloader = DataLoader(
        tokenized_dataset["validation"], batch_size=batch_size, collate_fn=data_collator
    )
    test_dataloader = DataLoader(
        tokenized_dataset["test"], batch_size=batch_size, collate_fn=data_collator
    )

    return train_dataloader, validation_dataloader, test_dataloader

When I train for BERT and RoBERTa, the tokenizer_padding function works perfetly. However, when I use deberta (microsoft/deberta-v3-base), I get the following error:

ValueError: Couldn’t instantiate the backend tokenizer from one of:
(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

With sentencepiece installed, I get this second error:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (input_ids in this case) have excessive nesting (inputs type list where type int is expected).

I have no idea why only the DeBERTa model fails. Can someone help me with that?

Sandy1857 · October 3, 2023, 5:43pm

Hope your issue is resolved by now. If not, you could use some of these steps:

When loading the tokenizer, you could use the use_fast=False parameter to set the slow tokenizer instead of fast one: eg. tokenizer = AutoTokenizer.from_pretrained(DEBERTA_MODEL, use_fast=False)
If you want to use the fast tokenizer, install sentencepiece and then restart the kernel for the update to take effect.
If the ValueError: Unable to create tensor, arises, you could add padding=‘max_length’ or ‘longest’ instead of True.

paulopirozelli · October 3, 2023, 6:24pm

Thanks, @Sandy1857. After some head-scratching, the final solution did the trick. It appears that ‘deberta’ processes inputs differently, requiring me to tweak the padding argument. When I set it to padding=‘longest’, it worked fine.

Topic		Replies	Views
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2510	May 9, 2022
It asks to add padding or truncation but I have already done it Beginners	1	827	October 6, 2023
ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	1	7559	January 26, 2023
Error in Model.prepare_tf_dataset() 🤗Transformers	1	696	July 5, 2023
Key error: 0 in DataCollatorForSeq2Seq for BERT Beginners	10	3991	March 13, 2024

DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

Related topics