DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

I’m trying to train a number of transformer models on a classification task. My dataset has only two columns: text and label.

As part of the pre-processing, I tokenize, pad, and truncate the texts. The input for this function is a datasets.dataset_dict.DatasetDict object, where ‘checkpoint’ refers to the transformer model been trained and ‘dataset_dict’ refers to the pandas dataframe that is been tokenized.

The pre-processing is done with the following script.

from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader

def tokenizer_padding(dataset_dict, checkpoint, batch_size):
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='max_length')

    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True)

    tokenized_dataset = dataset_dict.map(tokenize_function, batched=True)
    tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

    train_dataloader = DataLoader(
        tokenized_dataset["train"], shuffle=True, batch_size=batch_size, collate_fn=data_collator
    )
    validation_dataloader = DataLoader(
        tokenized_dataset["validation"], batch_size=batch_size, collate_fn=data_collator
    )
    test_dataloader = DataLoader(
        tokenized_dataset["test"], batch_size=batch_size, collate_fn=data_collator
    )

    return train_dataloader, validation_dataloader, test_dataloader

When I train for BERT and RoBERTa, the tokenizer_padding function works perfetly. However, when I use deberta (microsoft/deberta-v3-base), I get the following error:

ValueError: Couldn’t instantiate the backend tokenizer from one of:
(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

With sentencepiece installed, I get this second error:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (input_ids in this case) have excessive nesting (inputs type list where type int is expected).

I have no idea why only the DeBERTa model fails. Can someone help me with that?

Hope your issue is resolved by now. If not, you could use some of these steps:

  1. When loading the tokenizer, you could use the use_fast=False parameter to set the slow tokenizer instead of fast one: eg. tokenizer = AutoTokenizer.from_pretrained(DEBERTA_MODEL, use_fast=False)

  2. If you want to use the fast tokenizer, install sentencepiece and then restart the kernel for the update to take effect.

  3. If the ValueError: Unable to create tensor, arises, you could add padding=‘max_length’ or ‘longest’ instead of True.

1 Like

Thanks, @Sandy1857. After some head-scratching, the final solution did the trick. It appears that ‘deberta’ processes inputs differently, requiring me to tweak the padding argument. When I set it to padding=‘longest’, it worked fine.

1 Like