Stream image dataset from (Azure) cloud storage

sonnehansen · August 4, 2023, 7:30am

Hi,
I am trying to make our image data into a HF-formatted dataset with custom data loader script stored on our Azure blob storage. I would then like to stream (or use fsspec / adlfs) to consume the dataset without downloading the full dataset.

I have looked around and found:

official datasets doc on cloud storage, and I can make it work with load_from_disk but that is not what I want
A forum thread that discuss load_dataset('s3:...', streaming=True) and a related github issue that is fixed by PR

(I apologize for the missing links to official docs and PR but being new to the forum I can only include 2 links in a post )

However, I still cannot make it work with image files (there is no documentation around the feature and most examples I find are the supported formats ‘json’, …, not image files), a custom data loader script and without passing through a step of downloading and or zipping all files (I managed to stream using the azure blob storage REST API and images in a zip file).

The reason I would prefer not to ZIP is because we consume images from the storage directly. I would prefer to use the same location for the dataset, rather than duplicating the data to another location. I am very open to reasons why I am wrong in seeing this as an undesirable pattern.

Any thoughts if what I want to achieve is possible and if so, how to go about it? Or alternatively, other best-practices that I should rather pursue.

Thanks in advance , Mikael

vazgbruno · December 27, 2023, 5:12pm

I think I have a similar problem, but with audio files. Were you able to find a solution?

sonnehansen · January 8, 2024, 2:33pm

Hi @vazgbruno,

No, I was told it was not supported as of yet and I abandoned HF for the time being. It seems that the discussion continuous on the github issue.

Cheers
Mikael

vazgbruno · January 8, 2024, 2:45pm

Thanks for answering and for sharing the GitHub issue @sonnehansen!

All the best,
Bruno

Topic		Replies	Views
Download_custom method of StreamingDownloadManager not implemented 🤗Datasets	8	898	August 21, 2023
batched I/O from disk when load_dataset API is used? 🤗Datasets	2	27	January 27, 2025
Allow streaming of large datasets with image/audio 🤗Datasets	18	3945	May 30, 2022
How to steaming .hf dataset 🤗Datasets	5	69	November 30, 2024
Loading community JSON based datasets without a script 🤗Datasets	3	518	October 4, 2021

Stream image dataset from (Azure) cloud storage

Related topics