I’m trying to train a number of transformer models on a classification task. My dataset has only two columns: text and label.
As part of the pre-processing, I tokenize, pad, and truncate the texts. The input for this function is a datasets.dataset_dict.DatasetDict object, where ‘checkpoint’ refers to the transformer model been trained and ‘dataset_dict’ refers to the pandas dataframe that is been tokenized.
The pre-processing is done with the following script.
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader
def tokenizer_padding(dataset_dict, checkpoint, batch_size):
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='max_length')
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_dataset = dataset_dict.map(tokenize_function, batched=True)
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
train_dataloader = DataLoader(
tokenized_dataset["train"], shuffle=True, batch_size=batch_size, collate_fn=data_collator
)
validation_dataloader = DataLoader(
tokenized_dataset["validation"], batch_size=batch_size, collate_fn=data_collator
)
test_dataloader = DataLoader(
tokenized_dataset["test"], batch_size=batch_size, collate_fn=data_collator
)
return train_dataloader, validation_dataloader, test_dataloader
When I train for BERT and RoBERTa, the tokenizer_padding function works perfetly. However, when I use deberta (microsoft/deberta-v3-base), I get the following error:
ValueError: Couldn’t instantiate the backend tokenizer from one of:
(1) atokenizers
library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
With sentencepiece installed, I get this second error:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (
input_ids
in this case) have excessive nesting (inputs typelist
where typeint
is expected).
I have no idea why only the DeBERTa model fails. Can someone help me with that?