Hi,
I am trying to make our image data into a HF-formatted dataset with custom data loader script stored on our Azure blob storage. I would then like to stream (or use fsspec / adlfs) to consume the dataset without downloading the full dataset.
I have looked around and found:
- official datasets doc on cloud storage, and I can make it work with
load_from_disk
but that is not what I want - A forum thread that discuss
load_dataset('s3:...', streaming=True)
and a related github issue that is fixed by PR
(I apologize for the missing links to official docs and PR but being new to the forum I can only include 2 links in a post )
However, I still cannot make it work with image files (there is no documentation around the feature and most examples I find are the supported formats ‘json’, …, not image files), a custom data loader script and without passing through a step of downloading and or zipping all files (I managed to stream using the azure blob storage REST API and images in a zip file).
The reason I would prefer not to ZIP is because we consume images from the storage directly. I would prefer to use the same location for the dataset, rather than duplicating the data to another location. I am very open to reasons why I am wrong in seeing this as an undesirable pattern.
Any thoughts if what I want to achieve is possible and if so, how to go about it? Or alternatively, other best-practices that I should rather pursue.
Thanks in advance , Mikael