I am tokenizing my dataset with a customized tokenize_function
to tokenize 2 different texts and then append them toghether, this is the code:
# Load the datasets
data_files = {
"train": "train_pair.csv",
"test": "test_pair.csv",
"val": "val_pair.csv"
}
datasets = load_dataset('csv', data_files=data_files)
# tokenize the dataset
def tokenize_function(batch):
# Get the maximum length from the model configuration
max_length = 512
# Tokenize each text separately and truncate to half the maximum length
tokenized_text1 = tokenizer(batch['text1'], truncation=True, max_length=int(max_length/2), add_special_tokens=True)
tokenized_text2 = tokenizer(batch['text2'], truncation=True, max_length=int(max_length/2), add_special_tokens=True)
# Merge the results
tokenized_inputs = {
'input_ids': tokenized_text1['input_ids'] + tokenized_text2['input_ids'][1:], # exclude the [CLS] token from the second sequence
'attention_mask': tokenized_text1['attention_mask'] + tokenized_text2['attention_mask'][1:]
}
return tokenized_inputs
# Tokenize the datasets
tokenized_datasets = datasets.map(tokenize_function, batched=True)
This code is generating this error:
ArrowInvalid: Column 3 named input_ids expected length 1000 but got length 1999
The error is misleading, it suggests that the input_ids
length is 1999, while it is impossible for the maximum length of this column to be more than 512
. If I set batch=False
there is no error.
I also tried with different batch sizes such as 8 or 25 (cause the number of samples is dividable by 25) but it did not work.