Issues with Data Collator and Tokenizing with NER Datasets

I’m having some strange issues with the Data Collator and Dataloader. I get the following value error ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.. What is strange is as you can see from the code below, I’m both truncating and padding. I can see that when the dataset is returned from the tokenizer the input ids are all the same length. However, when I check the input ids length after they are loaded into the dataloader the lengths are variable. If I remove the collator and batch size arguments everything works fine with the same code. I assume I’m doing something stupid with the data collator? But I’ve tried a couple collators, datasets, models, and tokenizers and I have the same issue. Any thoughts?

from transformers import (
    DataCollatorWithPadding,
    DataCollatorForTokenClassification,
    AutoTokenizer,
    AutoModelForTokenClassification,
)

from datasets import load_dataset

from torch.utils.data import DataLoader

#Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

# tokenizing function
def tokenize(data):
    tokenized_samples = orig_tokenizer(
        data["tokens"],
        is_split_into_words=True,
        truncation=True,
        padding=True,
    )

#load wikiann dataset
dataset = load_dataset("wikiann", "bn")["train"]

#tokenize dataset with padding and truncation
dataset_tokenized = dataset.map(tokenize, batched=True)

#remove extra columns
dataset_tokenized = dataset_tokenized.remove_columns(["langs", "spans", "tokens"])

#change tag columns to labels
dataset_tokenized = dataset_tokenized.rename_column("ner_tags", "labels")

#instantiate collator - note also tried this with DataCollatorWithPadding
collator = DataCollatorForTokenClassification(tokenizer)

#Instantiate Pytorch DataLoader
dl = DataLoader(dataset_tokenized, shuffle=True, collate_fn=collator, batch_size=2)

I did figure out the issue. It turns out when you set padding=True that is equivalent to setting padding='longest_sequence'. If you use longest sequence to pad, it actually pads to the longest sequence in the batch, which is not consistent across the dataset, hence the error in the trainer. I’m not sure if I’m missing something, but it didnt seem like I could control batch size or use longest sequence across the whole dataset? So instead I used padding='max_length' which sets the padding to a consistent value across the dataset.

1 Like