We have a question about understanding the performance variation between Image and Array2D.
In my practical project, I encountered an issue with the processing speed of 2D small images in grayscale and larger 4K images with RGB 3 channels. Surprisingly, the speed of reading the images after converting them to Array2D was slower in practice compared to storing them as Images. This outcome was unexpected since our assumption was that the Array2D speed would surpass that of Image. I created a NumPy array to simulate this problem, as illustrated below:
When using Hugging Face’s Array2D and Image for reading 1000 small images(2000*600), there is a significant difference in reading speed with varying complexity
Steps to Reproduce:
- Generate a list of random small images using the following code:
import numpy as np
from datasets import Dataset, Features, Image, load_from_disk, Array2D
from torch.utils.data import DataLoader
import time
count = 1000
img_list = [np.random.randint(0, 256, (2000, 600), dtype=np.uint8) for _ in range(count)]
-
for different variations of initialization (randomizing values between 0 and 256, or fixing them at 255, or even 0).
-
Set up the dataset
Image:
features = Features({"img": Image()})
ds = Dataset.from_dict(img_dict, features=features)
ds.save_to_disk(ds_path,num_shards=3)
ds = load_from_disk(ds_path)
ds = ds.with_format("np")
Array2d:
features = Features({"img": Array2D(dtype="uint8", shape=(2000, 600))})
ds = Dataset.from_dict(img_dict, features=features)
ds.save_to_disk(ds_path,num_shards=3)
ds = load_from_disk(ds_path)
ds = ds.with_format("np")
-
Set the DataLoader batch size to 1.
-
Measure the reading speed using Array2D and Image methods.
dl = DataLoader(ds, batch_size=1)
start_time = time.time()
for data in dl:
a = data
pass
print(f"avg time {(time.time() - start_time)}")
Observed Behavior:
The reading speed varies significantly between Array2D and Image methods for different image initialization strategies.
-
For random initialization (
np.random.randint(0, 256, ...)
)-
Array2D speed: 4.6s
-
Image speed: 16s
-
-
For fixed values at 255 (
np.random.randint(255, 256, ...)
)-
Array2D speed: 5.3s
-
Image speed: 4.5s
-
-
For fixed values at 0 (
np.random.randint(0, 1, ...)
)-
Array2D speed: 5.2s
-
Image speed: 3.6s
-
Expected behavior
According to this issuse Abusurdly slow on iteration · Issue #5841 · huggingface/datasets · GitHub, array2d may have around x2.0 faster iteration speed than image
Environment info
datasets==2.14.6
pyarrow==13.0.0
pytorch==1.12.1
Python version: 3.8.15