I have a dataset object created from a pandas dataframe (no padding at this point because I want to pad it later inside collator):
df = pd.read_csv(file, dtype=object, header=None)
dataset_train = datasets.Dataset.from_pandas(df, preserve_index=False)
dataset_train = dataset_train.map(lambda examples: tokenizer(examples['feature1'],
padding='longest', max_length=2048),
batched=True, batch_size=32)
When I am using the data collator for masked language modeling as follows:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True,
mlm_probability=0.2, return_tensors='pt')
I am getting the following error while running the Trainer
routine:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
But, I am already padding the input batchwise and using the same batch while training. What am I doing wrong here?