How to convert dir-with-images properly?

Hi,

I have a kind of issue with using datasets properly and if you have a better solution, than my super dumb one, please share.
Context:
The task I am solving is panoptic segmentation.
I have a directory with the following structure:

images/* .jpg
segmentations/* .png
instance_ids/* .png

I create the dataset from the DatasetDict, and then push_to_hub. The data gets magically converted to Arrow+Parquet and uploaded to HF Hub.
Everything is nice, until the dataset becomes several millions of images, and the upload crashes at random moment. It is also a very slow process.

So, I was advised to use HF CLI tool, specifically this PR.

It works amazing - fast and reliable.
However, now I have to convert the dataset to Arrow+Parquet myself.
Moreover, I cannot just shard the dataset, because then the images are not part of the parquet files, only the filenames.
So the solution below does not work:

for k, v in dataset.items():
    print (f"Saving {k}")
    shard_size = 5000
    num_shards = len(v) // shard_size + 1  
    for shard_idx in range(num_shards):
        shard = v.shard(index=shard_idx, num_shards=num_shards)
        shard.to_parquet(f"out/parquet/data/{k}-{str(shard_idx).zfill(5)}-of-{str(num_shards).zfill(5)}.parquet")  # 00000.parquet to 01023.parquet

What I have to do instead, is first to save the dataset to disk in arrow format, then read from it, and then save in parquet.

dataset.save_to_disk("arrowdir", num_proc=48)
dataset = datasets.load_from_disk("arrowdir")
for k, v in dataset.items():
    print (f"Saving {k}")
    shard_size = 5000
    num_shards = len(v) // shard_size + 1  # set number of files to save (e.g. try to have files smaller than 5GB)
    for shard_idx in range(num_shards):
        shard = v.shard(index=shard_idx, num_shards=num_shards)
        shard.to_parquet(f"out/parquet/data/{k}-{str(shard_idx).zfill(5)}-of-{str(num_shards).zfill(5)}.parquet")  # 00000.parquet to 01023.parquet

This way I have to use disk space 3x size of the dataset: for original images, for arrow, and for parquet.
I also don’t like that shading is done manually, so it is error-prone.

Another way would be probably to use webdataset, and to do this I have to move my original files from

images/* .jpg
segmentations/* .png
instance_ids/* .png

structure into

dir/* .image.jpg
dir/* .segmentation.png
dir/* .instance.png

Which I don’t like either - because this way everything is dumped into one big dir.

Is there a better way of solving the problem?

Best, Dmytro.

Hi ! Indeed it seems that to_parquet() doesn’t embed the images inside the parquet data atm, which we should fix.

It can be fixed by using the same logic as push_to_hub(), i.e. applying a map() function that reads the images to embed them in the Arrow data. You can find the code that does this step here: datasets/src/datasets/arrow_dataset.py at 5fdad6da7fb0b306f63a39d0b03586b458e5ca07 · huggingface/datasets · GitHub

1 Like

Thank you, looks much cleaner than what I have now!