Map with batch=True gives ArrowInvalid error for mismatch in a column's expected length

SMMousavi · June 29, 2023, 4:36pm

I am tokenizing my dataset with a customized tokenize_function to tokenize 2 different texts and then append them toghether, this is the code:

# Load the datasets
data_files = {
    "train": "train_pair.csv",
    "test": "test_pair.csv",
    "val": "val_pair.csv"
}
datasets = load_dataset('csv', data_files=data_files)

# tokenize the dataset
def tokenize_function(batch):
    # Get the maximum length from the model configuration
    max_length = 512

    # Tokenize each text separately and truncate to half the maximum length
    tokenized_text1 = tokenizer(batch['text1'], truncation=True, max_length=int(max_length/2), add_special_tokens=True)
    tokenized_text2 = tokenizer(batch['text2'], truncation=True, max_length=int(max_length/2), add_special_tokens=True)

    # Merge the results
    tokenized_inputs = {
        'input_ids': tokenized_text1['input_ids'] + tokenized_text2['input_ids'][1:],  # exclude the [CLS] token from the second sequence
        'attention_mask': tokenized_text1['attention_mask'] + tokenized_text2['attention_mask'][1:]
    }
    return tokenized_inputs

# Tokenize the datasets
tokenized_datasets = datasets.map(tokenize_function, batched=True)

This code is generating this error:

ArrowInvalid: Column 3 named input_ids expected length 1000 but got length 1999

The error is misleading, it suggests that the input_ids length is 1999, while it is impossible for the maximum length of this column to be more than 512. If I set batch=False there is no error.
I also tried with different batch sizes such as 8 or 25 (cause the number of samples is dividable by 25) but it did not work.

Dnsibu · December 12, 2023, 8:08pm

@SMMousavi did you solve this issue? If so , could you detail it below.

Topic		Replies	Views
ArrowInvalid: Column 1 named id expected length 512 but got length 1000 🤗Datasets	4	15340	June 6, 2024
Getting pyarrow.lib.ArrowInvalid: Column 2 named start_positions expected length 1000 but got length 1 🤗Datasets	1	2093	July 27, 2023
ArrowInvalid: Column 3 named attention_mask expected length 1000 but got length 1076 🤗Tokenizers	3	2517	July 26, 2023
The 🤗 Datasets library - Hugging Face Course 🤗Datasets	1	568	November 25, 2021
Dataset.map returns error: pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values 🤗Datasets	1	1477	January 17, 2025

Map with batch=True gives ArrowInvalid error for mismatch in a column's expected length

Related topics