I’m having some strange issues with the Data Collator and Dataloader. I get the following value error
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.. What is strange is as you can see from the code below, I’m both truncating and padding. I can see that when the dataset is returned from the tokenizer the input ids are all the same length. However, when I check the input ids length after they are loaded into the dataloader the lengths are variable. If I remove the collator and batch size arguments everything works fine with the same code. I assume I’m doing something stupid with the data collator? But I’ve tried a couple collators, datasets, models, and tokenizers and I have the same issue. Any thoughts?
from transformers import ( DataCollatorWithPadding, DataCollatorForTokenClassification, AutoTokenizer, AutoModelForTokenClassification, ) from datasets import load_dataset from torch.utils.data import DataLoader #Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER") model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER") # tokenizing function def tokenize(data): tokenized_samples = orig_tokenizer( data["tokens"], is_split_into_words=True, truncation=True, padding=True, ) #load wikiann dataset dataset = load_dataset("wikiann", "bn")["train"] #tokenize dataset with padding and truncation dataset_tokenized = dataset.map(tokenize, batched=True) #remove extra columns dataset_tokenized = dataset_tokenized.remove_columns(["langs", "spans", "tokens"]) #change tag columns to labels dataset_tokenized = dataset_tokenized.rename_column("ner_tags", "labels") #instantiate collator - note also tried this with DataCollatorWithPadding collator = DataCollatorForTokenClassification(tokenizer) #Instantiate Pytorch DataLoader dl = DataLoader(dataset_tokenized, shuffle=True, collate_fn=collator, batch_size=2)