Hi !
TL;DR: How to process (resize+rescale) a huggingface dataset of 16.000 PIL-image as numpy array or tensorflow tensor and convert it to tensorflow-dataset. How to optimize it in terms of runtime and disk space ?
Iāve been discovering HuggingFace recently.
Iāve uploaded my first dataset, consisting of 16.500 images corentinm7/MyoQuant-SDH-Data Ā· Datasets at Hugging Face
Iām trying to import them in a Jupyter Notebook to train a model with Keras/Tensorflow.
I need first to process them in two ways: (i) resizing them using tf.image.resize(image, (256,256))
, (ii) rescaling them to [-1, +1] using tensorflow.keras.applications.resnet_v2.preprocess_input(x)
import tensorflow as tf
from tensorflow.keras.utils import load_img, img_to_array
from tensorflow.keras.applications.resnet_v2 import preprocess_input
from datasets import load_dataset
ds = load_dataset("corentinm7/MyoQuant-SDH-Data")
def transforms(examples):
examples["pixel_values"] = []
for image in examples["image"]:
_img = img_to_array(image)
_im_resized = tf.image.resize(_img, (256,256))
examples["pixel_values"].append(preprocess_input(_im_resized))
print("Finshed processing !")
return examples
ds = ds.map(transforms, remove_columns=["image"], batched=True)
tf_ds_train = ds["train"].to_tf_dataset(columns=["pixel_values"], label_cols=["label"], batch_size=BATCH_SIZE, shuffle=True)
tf_ds_val = ds["validation"].to_tf_dataset(columns=["pixel_values"], label_cols=["label"], batch_size=BATCH_SIZE, shuffle=True)
tf_ds_test = ds["test"].to_tf_dataset(columns=["pixel_values"], label_cols=["label"], batch_size=BATCH_SIZE, shuffle=True)
The issue I have is that the ds.map()
takes a very long time to run and then even crash. The print("Finshed processing !")
print really fast (meaning that the processing goes fast) but then it takes multiple minutes to go to the next batch.
0%| | 0/13 [00:00<?, ?ba/s]Finshed processing !
8%|ā | 1/13 [03:23<40:46, 203.86s/ba]Finshed processing !
15%|āā | 2/13 [06:46<37:15, 203.18s/ba]Finshed processing !
15%|āā | 2/13 [10:16<56:29, 308.15s/ba]
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
File /data/corentin-code-project/MyoQuant-SDH-ResNet/.venv/lib/python3.10/site-packages/datasets/arrow_dataset.py:2985, in Dataset._map_single(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc, cache_only)
2984 else:
-> 2985 writer.write_batch(batch)
2986 if update_data and writer is not None:
OSError: [Errno 28] No space left on device
It took 3 minutes to run three batches and then it crashed due to ālack of spaceā but I donāt understand how this could happen when the dataset is only 1 gigabits and I have multiple dozen of gigabits of space avaliable.
For the size issue, it looks like my 50 gigabits data volume is full because this single dataset created a huggingface cache folder of 33 gigabits ! I donāt understand how this can happen.
(.venv) (base) meyer@guepe:/data/corentin-cache/huggingface/datasets/corentinm7___myo_quant-sdh-data/SDH_16k/1.0.0/21b584239a638aeeda33cba1ac2ca4869d48e4b4f20fb22274d5a5ddc487659d$ ll -h
total 33G
drwxr-xr-x 2 meyer bio3d 4.0K Nov 5 21:18 ./
drwxr-xr-x 3 meyer bio3d 4.0K Nov 4 13:23 ../
-rw-r--r-- 1 meyer bio3d 8 Nov 4 13:23 LICENSE
-rw-r--r-- 1 meyer bio3d 786M Nov 4 17:56 cache-1ad7947b40ff2d38.arrow
-rw-r--r-- 1 meyer bio3d 7.3G Nov 5 12:59 cache-5f77551c894d805b.arrow
-rw-r--r-- 1 meyer bio3d 219M Nov 4 17:59 cache-62537c501b577b0c.arrow
-rw-r--r-- 1 meyer bio3d 88M Nov 4 23:29 cache-67d6a9f377e1ce60.arrow
-rw-r--r-- 1 meyer bio3d 219M Nov 4 13:44 cache-72a6dcd3cb4c3203.arrow
-rw-r--r-- 1 meyer bio3d 12G Nov 4 21:24 cache-76a99af51fbbb546.arrow
-rw-r--r-- 1 meyer bio3d 88M Nov 4 13:42 cache-aa2ea2c1422d4588.arrow
-rw-r--r-- 1 meyer bio3d 2.1G Nov 5 13:00 cache-b433c1f76ccec680.arrow
-rw-r--r-- 1 meyer bio3d 831M Nov 5 12:59 cache-c92b8a028fd3589e.arrow
-rw-r--r-- 1 meyer bio3d 1.4G Nov 4 21:24 cache-c982bc702ca6a33c.arrow
-rw-r--r-- 1 meyer bio3d 88M Nov 4 17:56 cache-ca19065bbaecae44.arrow
-rw-r--r-- 1 meyer bio3d 219M Nov 4 23:32 cache-d2186c63001436d6.arrow
-rw-r--r-- 1 meyer bio3d 3.3G Nov 4 21:25 cache-d776a4a6f73d425b.arrow
-rw-r--r-- 1 meyer bio3d 786M Nov 4 23:29 cache-dec757b4471093dc.arrow
-rw-r--r-- 1 meyer bio3d 786M Nov 4 13:42 cache-e890b897f44f7c7a.arrow
-rw-r--r-- 1 meyer bio3d 1.9K Nov 4 13:23 dataset_info.json
-rw-r--r-- 1 meyer bio3d 534K Nov 4 13:23 myo_quant-sdh-data-test.arrow
-rw-r--r-- 1 meyer bio3d 1.9M Nov 4 13:23 myo_quant-sdh-data-train.arrow
-rw-r--r-- 1 meyer bio3d 222K Nov 4 13:23 myo_quant-sdh-data-validation.arrow
-rw------- 1 meyer bio3d 3.0G Nov 5 21:28 tmpyp8oimey
Could anyone help me out figuring how to optimize this task ? (Having a dataset of PIL object that need to be resized and scaled as numpy or tensor compatible with tensorflow/keras, to optimize in terms of runtime and disk space.)
Thanks a lot !
EDIT:
Modifying the processing to non-batch seemās to be faster but also get stuck for a while after processing a ābatchā.
def transforms2(examples):
_img = img_to_array(examples["image"])
_im_resized = tf.image.resize(_img, (256,256))
examples["pixel_values"] = preprocess_input(_im_resized)
return examples
ds = ds.map(transforms2, remove_columns=["image"])
16%|āā | 1979/12085 [03:46<00:26, 378.73ex/s]
The first 8% took like 10 seconds and then it hangs here (for 5 minutes).
Then it spikes to 16% and hangs there for 5 minutesā¦
Eventualy it takes 1 hour to run and then itās the tf_ds_train = ds["train"].to_tf_dataset(columns=["pixel_values"], label_cols=["label"], batch_size=BATCH_SIZE, shuffle=True)
that is stuck forever running.