Issues with Data Collator and Tokenizing with NER Datasets

courtneysprouse131 · May 3, 2022, 7:59pm

I’m having some strange issues with the Data Collator and Dataloader. I get the following value error ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.. What is strange is as you can see from the code below, I’m both truncating and padding. I can see that when the dataset is returned from the tokenizer the input ids are all the same length. However, when I check the input ids length after they are loaded into the dataloader the lengths are variable. If I remove the collator and batch size arguments everything works fine with the same code. I assume I’m doing something stupid with the data collator? But I’ve tried a couple collators, datasets, models, and tokenizers and I have the same issue. Any thoughts?

from transformers import (
    DataCollatorWithPadding,
    DataCollatorForTokenClassification,
    AutoTokenizer,
    AutoModelForTokenClassification,
)

from datasets import load_dataset

from torch.utils.data import DataLoader

#Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

# tokenizing function
def tokenize(data):
    tokenized_samples = orig_tokenizer(
        data["tokens"],
        is_split_into_words=True,
        truncation=True,
        padding=True,
    )

#load wikiann dataset
dataset = load_dataset("wikiann", "bn")["train"]

#tokenize dataset with padding and truncation
dataset_tokenized = dataset.map(tokenize, batched=True)

#remove extra columns
dataset_tokenized = dataset_tokenized.remove_columns(["langs", "spans", "tokens"])

#change tag columns to labels
dataset_tokenized = dataset_tokenized.rename_column("ner_tags", "labels")

#instantiate collator - note also tried this with DataCollatorWithPadding
collator = DataCollatorForTokenClassification(tokenizer)

#Instantiate Pytorch DataLoader
dl = DataLoader(dataset_tokenized, shuffle=True, collate_fn=collator, batch_size=2)

courtneysprouse131 · May 9, 2022, 4:27pm

I did figure out the issue. It turns out when you set padding=True that is equivalent to setting padding='longest_sequence'. If you use longest sequence to pad, it actually pads to the longest sequence in the batch, which is not consistent across the dataset, hence the error in the trainer. I’m not sure if I’m missing something, but it didnt seem like I could control batch size or use longest sequence across the whole dataset? So instead I used padding='max_length' which sets the padding to a consistent value across the dataset.

Topic		Replies	Views
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Tokenizers	2	1483	October 3, 2023
ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	1	7558	January 26, 2023
It asks to add padding or truncation but I have already done it Beginners	1	827	October 6, 2023
DataCollator not padding as expected Intermediate	0	662	August 17, 2022
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	5035	June 21, 2023

Issues with Data Collator and Tokenizing with NER Datasets

Related topics