Stream image dataset from (Azure) cloud storage

Hi,
I am trying to make our image data into a HF-formatted dataset with custom data loader script stored on our Azure blob storage. I would then like to stream (or use fsspec / adlfs) to consume the dataset without downloading the full dataset.

I have looked around and found:

  • official datasets doc on cloud storage, and I can make it work with load_from_disk but that is not what I want
  • A forum thread that discuss load_dataset('s3:...', streaming=True) and a related github issue that is fixed by PR

(I apologize for the missing links to official docs and PR but being new to the forum I can only include 2 links in a post :cry: )

However, I still cannot make it work with image files (there is no documentation around the feature and most examples I find are the supported formats ‘json’, …, not image files), a custom data loader script and without passing through a step of downloading and or zipping all files (I managed to stream using the azure blob storage REST API and images in a zip file).

The reason I would prefer not to ZIP is because we consume images from the storage directly. I would prefer to use the same location for the dataset, rather than duplicating the data to another location. I am very open to reasons why I am wrong in seeing this as an undesirable pattern.

Any thoughts if what I want to achieve is possible and if so, how to go about it? Or alternatively, other best-practices that I should rather pursue.

Thanks in advance :slight_smile: , Mikael

I think I have a similar problem, but with audio files. Were you able to find a solution?

Hi @vazgbruno,

No, I was told it was not supported as of yet and I abandoned HF for the time being. It seems that the discussion continuous on the github issue.

Cheers
Mikael

Thanks for answering and for sharing the GitHub issue @sonnehansen!

All the best,
Bruno