How to convert dir-with-images properly?

dmytromishkin · June 11, 2024, 12:06pm

Hi,

I have a kind of issue with using datasets properly and if you have a better solution, than my super dumb one, please share.
Context:
The task I am solving is panoptic segmentation.
I have a directory with the following structure:

images/* .jpg
segmentations/* .png
instance_ids/* .png

I create the dataset from the DatasetDict, and then push_to_hub. The data gets magically converted to Arrow+Parquet and uploaded to HF Hub.
Everything is nice, until the dataset becomes several millions of images, and the upload crashes at random moment. It is also a very slow process.

So, I was advised to use HF CLI tool, specifically this PR.

github.com/huggingface/huggingface_hub

[Feedback welcome] CLI to upload arbitrary huge folder

huggingface:main ← huggingface:large-upload-cli

opened 02:48PM - 26 Apr 24 UTC

Wauplin

+862 -4

## What for? Upload arbitrarily large folders in a single command line! ⚠️… This tool is still experimental and is meant for power users. Expect some rough edges in the process. Feedback and bug reports would be very much appreciated ❤️ ## How to use it? **Install** ``` pip install git+https://github.com/huggingface/huggingface_hub@large-upload-cli ``` **Upload folder** ``` huggingface-cli large-upload <repo-id> <local-path> ``` Every minute a report is printed to the terminal with the current status. Apart from that, progress bars and errors are still displayed. ``` Large upload status: Progress: 104/104 hashed files (22.5G/22.5G) 0/42 preuploaded LFS files (0.0/22.5G) (+4 files with unknown upload mode yet) 58/104 committed files (24.9M/22.5G) (0 gitignored files) Jobs: sha256: 0 workers (0 items in queue) get_upload_mode: 0 workers (4 items in queue) preupload_lfs: 6 workers (36 items in queue) commit: 0 workers (0 items in queue) Elapsed time: 0:00:00 Current time: 2024-04-26 16:24:25 ``` Run `huggingface-cli large-upload --help` to see all options. ## What does it do? This CLI is intended to upload arbitrary large folders in a single command: - process is split in 4 steps: hash, get upload mode, lfs upload, commit - retry on error at each step - multi-threaded: workers are managed with queues - resumable: if the process is interrupted, you can re-run it. Only partially uploaded files are lost. - files are hashed only once - starts to upload files while other files are still been hashed - commit at most 50 files at a time - prevent concurrent commits - prevent rate limits as much as possible - prevent small commits - retry on error for all steps A `.hugginface/` folder will be created at the root of your folder to keep track of the progress. Please do not modify these files manually. If you feel this folder got corrupted, please report it here, delete the `.huggingface/` entirely and then restart you command. Some intermediate steps will be lost but the upload process should be able to continue correctly. ## Known limitations - cannot set a `path_in_repo` => always upload files at root of the folder. If you want to upload to a subfolder, you need to set the proper structure locally. - not optimized for `hf_transfer` (though it works) => better to set `--num-workers` to 2 otherwise CPU will be bloated - cannot delete files on repo while uploading folder - cannot set commit message/commit description - cannot create PR by itself => you must first create a PR manually, then provide `revision` ## What to review? Nothing yet. For now the goal is to gather as much feedback as possible. If it proves successful, I will clean the implementation and make it more production-ready. Also, this PR is built on top of https://github.com/huggingface/huggingface_hub/pull/2223 that is not merged yet, which makes the changes very long. For curious people, [here is the logic](https://github.com/huggingface/huggingface_hub/blob/0061401c89df84a56f232a39d96fbaf17414908d/src/huggingface_hub/large_upload.py#L247) to decide what should be the next task to perform.

It works amazing - fast and reliable.
However, now I have to convert the dataset to Arrow+Parquet myself.
Moreover, I cannot just shard the dataset, because then the images are not part of the parquet files, only the filenames.
So the solution below does not work:

for k, v in dataset.items():
    print (f"Saving {k}")
    shard_size = 5000
    num_shards = len(v) // shard_size + 1  
    for shard_idx in range(num_shards):
        shard = v.shard(index=shard_idx, num_shards=num_shards)
        shard.to_parquet(f"out/parquet/data/{k}-{str(shard_idx).zfill(5)}-of-{str(num_shards).zfill(5)}.parquet")  # 00000.parquet to 01023.parquet

What I have to do instead, is first to save the dataset to disk in arrow format, then read from it, and then save in parquet.

dataset.save_to_disk("arrowdir", num_proc=48)
dataset = datasets.load_from_disk("arrowdir")
for k, v in dataset.items():
    print (f"Saving {k}")
    shard_size = 5000
    num_shards = len(v) // shard_size + 1  # set number of files to save (e.g. try to have files smaller than 5GB)
    for shard_idx in range(num_shards):
        shard = v.shard(index=shard_idx, num_shards=num_shards)
        shard.to_parquet(f"out/parquet/data/{k}-{str(shard_idx).zfill(5)}-of-{str(num_shards).zfill(5)}.parquet")  # 00000.parquet to 01023.parquet

This way I have to use disk space 3x size of the dataset: for original images, for arrow, and for parquet.
I also don’t like that shading is done manually, so it is error-prone.

Another way would be probably to use webdataset, and to do this I have to move my original files from

images/* .jpg
segmentations/* .png
instance_ids/* .png

structure into

dir/* .image.jpg
dir/* .segmentation.png
dir/* .instance.png

Which I don’t like either - because this way everything is dumped into one big dir.

Is there a better way of solving the problem?

Best, Dmytro.

lhoestq · June 11, 2024, 4:41pm

Hi ! Indeed it seems that to_parquet() doesn’t embed the images inside the parquet data atm, which we should fix.

It can be fixed by using the same logic as push_to_hub(), i.e. applying a map() function that reads the images to embed them in the Arrow data. You can find the code that does this step here: datasets/src/datasets/arrow_dataset.py at 5fdad6da7fb0b306f63a39d0b03586b458e5ca07 · huggingface/datasets · GitHub

dmytromishkin · June 11, 2024, 5:00pm

Thank you, looks much cleaner than what I have now!

Topic		Replies	Views
Cannot push dataset of 100k images 🤗Hub	0	579	March 26, 2023
Datasetdict push_to_hub failing with payload to large 🤗Datasets	6	75	February 11, 2025
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024
Huggingface-cli to load_dataset 🤗Datasets	4	3797	March 6, 2024
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2580	October 14, 2022

How to convert dir-with-images properly?

Related topics