OverflowError: There was an overflow with type <class 'list'>. Try to reduce writer_batch_size to have batches smaller than 2GB

plamb · May 12, 2022, 5:02pm

Getting the error in the topic when I try to run datasets .map() function.

I’m preprocessing about ~3,000 large pdf documents which have been converted into pages of images, words and bboxes. .map() is being as follows:

 dataset = Dataset.from_pandas(data)
    features = Features({
        'image': Array3D(dtype="int64", shape=(3, 224, 224)),
        'input_ids': Sequence(feature=Value(dtype='int64')),
        'attention_mask': Sequence(Value(dtype='int64')),
        'token_type_ids': Sequence(Value(dtype='int64')),
        'bbox': Array2D(dtype="int64", shape=(512, 4)),
        'label': ClassLabel(num_classes=len(unique_labels), names=unique_labels),
    })
    encoded_dataset = dataset.map(preprocess_data, features=features)

and the preprocess_data function looks like:

def preprocess_data(examples):
    directory = os.path.join(FILES_PATH, examples['file_location'])
    images_dir = os.path.join(directory, PDF_IMAGE_DIR)
    textract_response_path = os.path.join(directory, 'textract_response.json')
    doc_meta_path = os.path.join(directory, 'pdf_meta.json')
    textract_document = get_textract_document(textract_response_path, doc_meta_path)
    images, words, bboxes = get_doc_training_data(images_dir, textract_document)
    encoded_inputs = PROCESSOR(images, words, boxes=bboxes, padding="max_length", truncation=True)
    # https://github.com/NielsRogge/Transformers-Tutorials/issues/36
    encoded_inputs["image"] = np.array(encoded_inputs["image"])
    encoded_inputs["label"] = examples['label_id']
    return encoded_inputs

This setup worked fine for the first 998/2933 instances, on instance 999 I first got this error:
ArrowInvalid: Can only convert 1-dimensional array values which I fixed by adding:

encoded_inputs["image"] = np.array(encoded_inputs["image"])

As specified in ArrowInvalid: Can only convert 1-dimensional array values · Issue #36 · NielsRogge/Transformers-Tutorials · GitHub

However, now on the exact same instance (999) I now get:

OverflowError: There was an overflow with type <class 'list'>. Try to reduce writer_batch_size to have batches smaller than 2GB.
(offset overflow while concatenating arrays)

Any idea what is going on here?

plamb · May 13, 2022, 6:09pm

Posted a bug report on Huggingface → Datasets that better represents what’s going on here: Dataset.map()'s parameter writer_batch_size has to match the number of rows of my Dataset for .map() to not fail. · Issue #4349 · huggingface/datasets · GitHub

plamb · May 14, 2022, 6:01pm

This bug report actually clarifies the issue:

huggingface/datasets#4352

Topic		Replies	Views
Map function skipping rows (only 8k out of 1.6M rows) 🤗Datasets	1	195	December 25, 2023
Dataset.map() OSError: [Errno 12] Cannot allocate memory Beginners	0	988	October 10, 2021
Map method to tokenize raises index error 🤗Datasets	9	4274	June 9, 2021
I have a dataset of texts that I want to split into shorter texts 🤗Datasets	1	1062	October 16, 2023
Error in Dataset Map Function Beginners	3	2238	March 22, 2023

OverflowError: There was an overflow with type <class 'list'>. Try to reduce writer_batch_size to have batches smaller than 2GB

Related topics