Image dataset best practices?

Hi Bert, thanks for reaching out, and good job with segments.ai !

You mentioned three different ways of hosting an image dataset, and they’re all sensible ideas. Here are a few aspects that can help deciding which one is best depending on your case:

  1. Storing the URLs. It has several disadvantages: less convenient, less reproducibility, and probably doesn’t work in the long run. This should be avoided as much as possible IMO. However for certain datasets with copyright/licensing issues this can still be a solution, in particular if you’re not allowed to redistribute the images yourself.

  2. Use Parquet files (e.g. from push_to_hub). It’s nice on several aspects:

    a. You can store much more than just the images: parquet is a columnar format so you can have one column for the image data, one for the labels, and even more columns for metadata for example.

    b. It has compression that makes it suitable for long-term storage of big datasets.

    c. It’s a standard format for columnar data processing (think pandas, dask, spark)

    d. It is compatible with efficient data streaming: you can stream your image dataset during training for example

    e. It makes dataset sharing easy, since everything is packaged together (images + labels + metadata)

    f. Parquet files are suitable for sharding: if your dataset is too big (hundreds of GB or terabytes), you can just split it in parquet files of reasonable size (like 200-500MB per file)

    g. You can append new entries simply by adding new parquet files.

    h. You can have random-access to your images easily

    i. It works very well with Arrow (the back-end storage of HF Datasets)

    However as you said updating the dataset requires to regenerate the parquet files

  3. Store the raw images. It Is very flexible since you can add/update/remove images pretty easily. It can be convenient especially for small datasets, however:

    a. You’ll start to have trouble using such a format for big datasets (hundreds of thousands of images). It may require extra effort to structure your directories to find your images easily and align them with your labels or metadata.

    b. You need to use some standard structures to let systems load your images and your labels automatically. Those structures are often task specific, and need to be popular enough to be supported in your favorite libraries. Alternatively you might need to implement the data loading yourself.

    c. It’s also extremely inefficient for data streaming, since you have to fetch the images one by one.

    d. To share such datasets you have to zip/tar them

To conclude on this:

Option 2. Is the go-to solution if your dataset is big, frozen or if you need fancy stuff like parallel processing or streaming.

Option 3. Is preferable only for a small dataset, when constructing it or when you need flexibility

Option 1. Should be avoided, unless you have no other option

cc @osanseviero

7 Likes