Ds.map(): optimizing PIL Image processing as tensorflow tensor

Hi !

TL;DR: How to process (resize+rescale) a huggingface dataset of 16.000 PIL-image as numpy array or tensorflow tensor and convert it to tensorflow-dataset. How to optimize it in terms of runtime and disk space ?

I’ve been discovering HuggingFace recently.
I’ve uploaded my first dataset, consisting of 16.500 images corentinm7/MyoQuant-SDH-Data · Datasets at Hugging Face
I’m trying to import them in a Jupyter Notebook to train a model with Keras/Tensorflow.
I need first to process them in two ways: (i) resizing them using tf.image.resize(image, (256,256)), (ii) rescaling them to [-1, +1] using tensorflow.keras.applications.resnet_v2.preprocess_input(x)

import tensorflow as tf
from tensorflow.keras.utils import load_img, img_to_array
from tensorflow.keras.applications.resnet_v2 import preprocess_input
from datasets import load_dataset

ds = load_dataset("corentinm7/MyoQuant-SDH-Data")

def transforms(examples):
    examples["pixel_values"] = []
    for image in examples["image"]:
        _img = img_to_array(image)
        _im_resized = tf.image.resize(_img, (256,256))
        examples["pixel_values"].append(preprocess_input(_im_resized))
    print("Finshed processing !")
    return examples

ds = ds.map(transforms, remove_columns=["image"], batched=True)

tf_ds_train = ds["train"].to_tf_dataset(columns=["pixel_values"], label_cols=["label"], batch_size=BATCH_SIZE, shuffle=True)
tf_ds_val = ds["validation"].to_tf_dataset(columns=["pixel_values"], label_cols=["label"], batch_size=BATCH_SIZE, shuffle=True)
tf_ds_test = ds["test"].to_tf_dataset(columns=["pixel_values"], label_cols=["label"], batch_size=BATCH_SIZE, shuffle=True)

The issue I have is that the ds.map() takes a very long time to run and then even crash. The print("Finshed processing !") print really fast (meaning that the processing goes fast) but then it takes multiple minutes to go to the next batch.

  0%|          | 0/13 [00:00<?, ?ba/s]Finshed processing !
  8%|▊         | 1/13 [03:23<40:46, 203.86s/ba]Finshed processing !
 15%|█▌        | 2/13 [06:46<37:15, 203.18s/ba]Finshed processing !
 15%|█▌        | 2/13 [10:16<56:29, 308.15s/ba]

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File /data/corentin-code-project/MyoQuant-SDH-ResNet/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py:2985, in Dataset._map_single(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc, cache_only)
   2984             else:
-> 2985                 writer.write_batch(batch)
   2986 if update_data and writer is not None:

OSError: [Errno 28] No space left on device

It took 3 minutes to run three batches and then it crashed due to “lack of space” but I don’t understand how this could happen when the dataset is only 1 gigabits and I have multiple dozen of gigabits of space avaliable.

For the size issue, it looks like my 50 gigabits data volume is full because this single dataset created a huggingface cache folder of 33 gigabits ! I don’t understand how this can happen.

(.venv) (base) meyer@guepe:/data/corentin-cache/huggingface/datasets/corentinm7___myo_quant-sdh-data/SDH_16k/1.0.0/21b584239a638aeeda33cba1ac2ca4869d48e4b4f20fb22274d5a5ddc487659d$ ll -h
total 33G
drwxr-xr-x 2 meyer bio3d 4.0K Nov  5 21:18 ./
drwxr-xr-x 3 meyer bio3d 4.0K Nov  4 13:23 ../
-rw-r--r-- 1 meyer bio3d    8 Nov  4 13:23 LICENSE
-rw-r--r-- 1 meyer bio3d 786M Nov  4 17:56 cache-1ad7947b40ff2d38.arrow
-rw-r--r-- 1 meyer bio3d 7.3G Nov  5 12:59 cache-5f77551c894d805b.arrow
-rw-r--r-- 1 meyer bio3d 219M Nov  4 17:59 cache-62537c501b577b0c.arrow
-rw-r--r-- 1 meyer bio3d  88M Nov  4 23:29 cache-67d6a9f377e1ce60.arrow
-rw-r--r-- 1 meyer bio3d 219M Nov  4 13:44 cache-72a6dcd3cb4c3203.arrow
-rw-r--r-- 1 meyer bio3d  12G Nov  4 21:24 cache-76a99af51fbbb546.arrow
-rw-r--r-- 1 meyer bio3d  88M Nov  4 13:42 cache-aa2ea2c1422d4588.arrow
-rw-r--r-- 1 meyer bio3d 2.1G Nov  5 13:00 cache-b433c1f76ccec680.arrow
-rw-r--r-- 1 meyer bio3d 831M Nov  5 12:59 cache-c92b8a028fd3589e.arrow
-rw-r--r-- 1 meyer bio3d 1.4G Nov  4 21:24 cache-c982bc702ca6a33c.arrow
-rw-r--r-- 1 meyer bio3d  88M Nov  4 17:56 cache-ca19065bbaecae44.arrow
-rw-r--r-- 1 meyer bio3d 219M Nov  4 23:32 cache-d2186c63001436d6.arrow
-rw-r--r-- 1 meyer bio3d 3.3G Nov  4 21:25 cache-d776a4a6f73d425b.arrow
-rw-r--r-- 1 meyer bio3d 786M Nov  4 23:29 cache-dec757b4471093dc.arrow
-rw-r--r-- 1 meyer bio3d 786M Nov  4 13:42 cache-e890b897f44f7c7a.arrow
-rw-r--r-- 1 meyer bio3d 1.9K Nov  4 13:23 dataset_info.json
-rw-r--r-- 1 meyer bio3d 534K Nov  4 13:23 myo_quant-sdh-data-test.arrow
-rw-r--r-- 1 meyer bio3d 1.9M Nov  4 13:23 myo_quant-sdh-data-train.arrow
-rw-r--r-- 1 meyer bio3d 222K Nov  4 13:23 myo_quant-sdh-data-validation.arrow
-rw------- 1 meyer bio3d 3.0G Nov  5 21:28 tmpyp8oimey

Could anyone help me out figuring how to optimize this task ? (Having a dataset of PIL object that need to be resized and scaled as numpy or tensor compatible with tensorflow/keras, to optimize in terms of runtime and disk space.)

Thanks a lot !

EDIT:
Modifying the processing to non-batch seem’s to be faster but also get stuck for a while after processing a “batch”.

def transforms2(examples):
    _img = img_to_array(examples["image"])
    _im_resized = tf.image.resize(_img, (256,256))
    examples["pixel_values"] = preprocess_input(_im_resized)
    return examples
ds = ds.map(transforms2, remove_columns=["image"])

16%|█▋ | 1979/12085 [03:46<00:26, 378.73ex/s]
The first 8% took like 10 seconds and then it hangs here (for 5 minutes).
Then it spikes to 16% and hangs there for 5 minutes…

Eventualy it takes 1 hour to run and then it’s the tf_ds_train = ds["train"].to_tf_dataset(columns=["pixel_values"], label_cols=["label"], batch_size=BATCH_SIZE, shuffle=True) that is stuck forever running.

Hi !

Re: Arrow file size:

As you may know, storing arrays of pixel values takes much more space on disk than encoded images. Moreover if you don’t specify that all your arrays have the same size in advance, then all the offsets of each pixel in the arrays are stored - in case one sequence has more items than the others.

You can use the Array2D feature type to specify the shape of you images:

from datasets import Array2D
ds = ds.map(..., features=Features({"pixel_values": Array2D(dtype="uint8", shape=(256, 256))}))

Though it’s often more efficient to convert the images to arrays and resize them on-the-fly during training in order to save disk space.

Re: conversion to TF dataset:

IIRC to_tf_dataset doesn’t support the Image type (cc @Rocketknight1 )
So I’d encourage you to use a string type containing the path your images, or a binary type containing the image blob before converting to a TF dataset