Best practice loading images files

phihung · March 25, 2022, 3:54pm

Hi everybody,
I’m looking for best practices for loading image files.
I’ve tried so far

ds = Dataset.from_dict({"path": [my_jpg]})

# Method 1: cast + transform
ds_1 = (
    ds.rename_column("path", "pixel_values")
    .cast_column("pixel_values", Image())
    .with_transform(transform_to_tensor)
)

# Method 2: map + map
features = Features({"pixel_values": Array3D(..., dtype="float64")})
ds_2 = (
    ds.map(read_images, batched=True, batch_size=64)
    .map(transform_to_numpy, features=features, batched=True, batch_size=64, num_proc=8)
    .with_format("pt")
)

# Method 3: map + transform
ds_3 = (
    ds.map(read_images, batched=True, batch_size=64)
    .with_transform(transform_to_tensor)
)

# Method 4: ONE MAP
ds_4 = (
    ds.map(read_images_and_process_to_numpy, batched=True, batch_size=64)
    .with_format("pt")
)

Method 1 is really nice but as everything is lazy loading, the training loop is terribly slow (time spending on disk + CPU data loading)
Method 2 and 3: 1 hour for the read_images map with 8Gb data
Method 4: kind of working. But maybe not the most elegant? All data aug is done during loading.

nielsr · March 26, 2022, 12:03pm

Hi,

Probably the easiest way to load images into a Dataset is by leveraging the new ImageFolder builder.

lhoestq · April 7, 2022, 9:50am

method 1 saves you some disk space, since it reads the image from the path you provided without copying the image on your disk

method 2 decodes your image completely and saves it on disk. The Array3D type makes it very fast to read the images, with zero-copy reads of the arrays from your disk. So you get high throughput at the cost of disk space.

method 3 doesn’t give better performance than 1, but it requires some processing and your map call copies the image in a new dataset, which consumes a bit of disk space

method 4 is same as 2, but since it doesn’t have any intermediate step you save some disk space. Note that you didn’t use Array3D for this one, but you probably should to get the best performance

As mentioned above, feel free to use the imagefolder data loader, and combined with with_transform(transform_to_tensor) you get a good tradeoff in performance vs disk space used

the training loop is terribly slow (time spending on disk + CPU data loading)

To improve data loading speed, feel free to use a PyToch DataLoader with num_workers > 1, this way the image decoding can be done in parallel in subprocesses.

omgwenxx · March 27, 2024, 10:55am

Took me way too long to load 30GB of images with the ImageFolder, used this approach instead which decreased 70 seconds loading into 0.03 seconds.

Topic		Replies	Views
Iterating over Image feature columns is extremely slow 🤗Datasets	2	41	April 11, 2025
Extremely slow data loading of imagefolder 🤗Datasets	9	2439	January 4, 2024
[Solved] Image dataset seems slow for larger image size 🤗Datasets	7	3409	December 16, 2021
Turn of automatic Pil image generation in load_dataset 🤗Datasets	2	36	August 21, 2024
Ds.map(): optimizing PIL Image processing as tensorflow tensor 🤗Datasets	2	1365	April 27, 2024

Best practice loading images files

Related topics