Image dataset best practices?

Hi! I’m one of the founders of Segments.ai, a data labeling platform for computer vision. We’re working on an integration with HuggingFace, making it possible to export labeled datasets to the :hugs: hub.

From reading the docs and toying around a bit, I found there’s a few potential ways to set up an image dataset:

  1. Keep the image files out of the repository, and download them from their URLs (they’re hosted in the cloud) in the dataset loading script. The disadvantage here is that if the image URLs ever become unavailable, the dataset also won’t work anymore.
  2. Store the image files in the repository, packed together in a few large parquet files, using git-lfs. This is basically what happens when you create a dataset with an image column locally, and run dataset.push_to_hub(). See this dataset.
  3. Store the image files in the repository as individual jpg/png files, using git-lfs. Compared to the previous approach, this requires a custom dataset loading script. This seems cleaner from a versioning point of view: when images are added or removed later on, it leads to a compact diff compared to working with the parquet files. But perhaps it’s not ideal to have so many small files in a git-lfs repo.

Do you have any recommendations on what would be the cleanest approach that is considered best practice?

6 Likes

Hi Bert, thanks for reaching out, and good job with segments.ai !

You mentioned three different ways of hosting an image dataset, and they’re all sensible ideas. Here are a few aspects that can help deciding which one is best depending on your case:

  1. Storing the URLs. It has several disadvantages: less convenient, less reproducibility, and probably doesn’t work in the long run. This should be avoided as much as possible IMO. However for certain datasets with copyright/licensing issues this can still be a solution, in particular if you’re not allowed to redistribute the images yourself.

  2. Use Parquet files (e.g. from push_to_hub). It’s nice on several aspects:

    a. You can store much more than just the images: parquet is a columnar format so you can have one column for the image data, one for the labels, and even more columns for metadata for example.

    b. It has compression that makes it suitable for long-term storage of big datasets.

    c. It’s a standard format for columnar data processing (think pandas, dask, spark)

    d. It is compatible with efficient data streaming: you can stream your image dataset during training for example

    e. It makes dataset sharing easy, since everything is packaged together (images + labels + metadata)

    f. Parquet files are suitable for sharding: if your dataset is too big (hundreds of GB or terabytes), you can just split it in parquet files of reasonable size (like 200-500MB per file)

    g. You can append new entries simply by adding new parquet files.

    h. You can have random-access to your images easily

    i. It works very well with Arrow (the back-end storage of HF Datasets)

    However as you said updating the dataset requires to regenerate the parquet files

  3. Store the raw images. It Is very flexible since you can add/update/remove images pretty easily. It can be convenient especially for small datasets, however:

    a. You’ll start to have trouble using such a format for big datasets (hundreds of thousands of images). It may require extra effort to structure your directories to find your images easily and align them with your labels or metadata.

    b. You need to use some standard structures to let systems load your images and your labels automatically. Those structures are often task specific, and need to be popular enough to be supported in your favorite libraries. Alternatively you might need to implement the data loading yourself.

    c. It’s also extremely inefficient for data streaming, since you have to fetch the images one by one.

    d. To share such datasets you have to zip/tar them

To conclude on this:

Option 2. Is the go-to solution if your dataset is big, frozen or if you need fancy stuff like parallel processing or streaming.

Option 3. Is preferable only for a small dataset, when constructing it or when you need flexibility

Option 1. Should be avoided, unless you have no other option

cc @osanseviero

7 Likes

Interesting!

Maybe it would be a good idea to convert the HF example datasets to Parquet files. I was trying to set up my own HF dataset and simply copied the approach of loading the images from URLs from the mnist dataset.

1 Like

Thanks Quentin for your extensive answer, very informative and extremely helpful!

Sounds like option 2 is the way to go for us, and is also the easiest to get started thanks to the push_to_hub functionality.

Expect some more dataset-related questions from us in the coming days ;).

1 Like

FYI, the endpoints are public and are documented here. That means that if you want to manage it through other programming languages it should be feasible as well.

There is also huggingface_hub (huggingface_hub/src/huggingface_hub at main · huggingface/huggingface_hub · GitHub), a Python library to interact with these endpoints, but if you’re already using push_to_hub, that’s the way to go.

1 Like

Came across a nice blog post regarding uploading an image dataset to the hub using push_to_hub (option 2 in this thread).

1 Like

Hi @segments-bert @segments-tobias ! We’ve just merged a PR that adds support for the Image feature in push_to_hub. You can test it by installing datasets from master:

pip install git+https://github.com/huggingface/datasets

Any feedback is greatly appreciated!

3 Likes

Hey @mariosasko , this is great! Before, I used the dataset.map() approach described in the blogpost mentioned by @nielsr. Now I can skip that and just push to the hub like this:

features = datasets.Features({
    'name': datasets.Value('string'),
    'image': datasets.Image(),
})

dataset_dict = {
    'name': ['cat.jpg'],
    'image': ['cat.jpg']
}

dataset = datasets.Dataset.from_dict(dataset_dict, features)
dataset.push_to_hub('segments-bert/image-upload-test')

A few additional questions:

  • There is no difference between the dataset.map() approach and this approach, in how the images end up in the parquet files, right?
  • Do I understand correctly that the image bytes are stored in decoded form in the parquet files, i.e. not in their original .jpg or .png format? I assume so, given this comment by @lhoestq .
  • Would it make sense to add a generic “File” feature, besides the existing Image and Audio features? I’m asking because our data labeling platform also supports video and point cloud labeling, for which the data comes in a variety of file formats (.pcd, .bin, .mp4, …). Would be nice to push such datasets to the hub as well.
1 Like

Hi @segments-bert! Thanks, glad you like it.

  • There is no difference between the dataset.map() approach and this approach, in how the images end up in the parquet files, right?

Yes, there is no difference.

  • Do I understand correctly that the image bytes are stored in decoded form in the parquet files, i.e. not in their original .jpg or .png format? I assume so, given this comment by @lhoestq .

The image bytes stored in the parquet files are raw/unprocessed.

  • Would it make sense to add a generic “File” feature, besides the existing Image and Audio features? I’m asking because our data labeling platform also supports video and point cloud labeling, for which the data comes in a variety of file formats (.pcd, .bin, .mp4, …). Would be nice to push such datasets to the hub as well.

An interesting proposal. We will consider it. For now, you can embed file bytes (Value("binary")) in a dataset before push_to_hub and apply some additional preprocessing after loading.

1 Like

In choosing between option 2 and 3.

The images are being stored, so there is already no risk of data loss. That could be a major reason for having to remove a point from the data set. Since that is gone, it is very likely the data set will be append-only for the life of the project. So for this, there is not much value in using individual file with their more readable diffs and delta compression.