Best practice loading images files

Hi everybody,
I’m looking for best practices for loading image files.
I’ve tried so far

ds = Dataset.from_dict({"path": [my_jpg]})

# Method 1: cast + transform
ds_1 = (
    ds.rename_column("path", "pixel_values")
    .cast_column("pixel_values", Image())
    .with_transform(transform_to_tensor)
)

# Method 2: map + map
features = Features({"pixel_values": Array3D(..., dtype="float64")})
ds_2 = (
    ds.map(read_images, batched=True, batch_size=64)
    .map(transform_to_numpy, features=features, batched=True, batch_size=64, num_proc=8)
    .with_format("pt")
)

# Method 3: map + transform
ds_3 = (
    ds.map(read_images, batched=True, batch_size=64)
    .with_transform(transform_to_tensor)
)

# Method 4: ONE MAP
ds_4 = (
    ds.map(read_images_and_process_to_numpy, batched=True, batch_size=64)
    .with_format("pt")
)
  • Method 1 is really nice but as everything is lazy loading, the training loop is terribly slow (time spending on disk + CPU data loading)
  • Method 2 and 3: 1 hour for the read_images map with 8Gb data
  • Method 4: kind of working. But maybe not the most elegant? All data aug is done during loading.

Hi,

Probably the easiest way to load images into a Dataset is by leveraging the new ImageFolder builder.

method 1 saves you some disk space, since it reads the image from the path you provided without copying the image on your disk

method 2 decodes your image completely and saves it on disk. The Array3D type makes it very fast to read the images, with zero-copy reads of the arrays from your disk. So you get high throughput at the cost of disk space.

method 3 doesn’t give better performance than 1, but it requires some processing and your map call copies the image in a new dataset, which consumes a bit of disk space

method 4 is same as 2, but since it doesn’t have any intermediate step you save some disk space. Note that you didn’t use Array3D for this one, but you probably should to get the best performance :wink:

As mentioned above, feel free to use the imagefolder data loader, and combined with with_transform(transform_to_tensor) you get a good tradeoff in performance vs disk space used

the training loop is terribly slow (time spending on disk + CPU data loading)

To improve data loading speed, feel free to use a PyToch DataLoader with num_workers > 1, this way the image decoding can be done in parallel in subprocesses.

Took me way too long to load 30GB of images with the ImageFolder, used this approach instead which decreased 70 seconds loading into 0.03 seconds.