Hi,
I have split 2 datasets of the same type, tweets and labels for sequence classification.
Both I create the exact same way, from pandas datasets.
They have the same columns, texts, labels in pre dataset conversion and later
labels
, input_ids
and attention_masks
.
For one I can call Trainer.train()
but for the other I get this error:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
i thought at first it might be due to having longer texts in the one where the error occurs, but the other one actually has a longer max sequence and the mean is about the same 115 characters long.
The min length is exactly the same. 13.
There are no None or nan values.
Can someone point to what this means?
This is the tokenize function I use from the docs:
def tokenize(batch):
return tokenizer(batch["texts"], padding=True, truncation=True)
edit 1:
hmm, could it be that it’s because of emojis like smiley faces being present?
edit 2:
hmm, no, all emojis removed still the same error.