Dataset.map() OSError: [Errno 12] Cannot allocate memory

gordonlim · October 10, 2021, 4:21pm

I am using dataset.map() on 160k items. It stopped at about 25.1k saying that there is error with memory allocation. Is there a workaround for this without having to get more RAM? I’m wondering if there’s a way to save_to_disk() for every 10k items? Is that possible?

from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D

# we need to define custom features
features = Features({
    'image': Array3D(dtype="int64", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'token_type_ids': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': ClassLabel(num_classes=len(labels), names=labels),
})

def preprocess_data(examples):
    # take a batch of images
    images = [Image.open("images/"+path).convert("RGB") for path in examples['image_path']]
    # LayoutLMv2Processor 
    encoded_inputs = processor(images, padding="max_length", truncation=True)
    encoded_inputs["image"] = np.array(encoded_inputs["image"])
    # add labels
    encoded_inputs["labels"] = [label for label in examples["label"]]

    return encoded_inputs

encoded_dataset = dataset.map(preprocess_data, remove_columns=dataset.column_names, features=features, 

                              batched=True, batch_size=2)

Topic		Replies	Views
Datasets map tokenization throws OSError: No space left on device 🤗Datasets	8	2508	March 19, 2025
Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")` Beginners	3	3449	June 8, 2022
Map() function freezes on large dataset 🤗Datasets	8	2967	September 10, 2023
Arrowmemoryerror: realloc of size 32 GB failed 🤗Datasets	2	3218	January 6, 2023
Deal with large image datasets 🤗Datasets	1	1061	October 22, 2021

Dataset.map() OSError: [Errno 12] Cannot allocate memory

Related topics