Download_custom method of StreamingDownloadManager not implemented

kevin-guillet · December 22, 2022, 4:01pm

Hello everyone,

I have a file dataset (CSV or JSONL) hosted on a private S3 bucket And I wrote a script hosted in my HF Dataset repository (the script is straightforward but I can show it to you if necessary).

The fact is my dataset is really heavy and I want to use it via Dataset streaming mode:

dataset = load_dataset("my_dataset", streaming=True)

So my problem is:

A private hosted dataset need to be downloaded using DownloadManager.download_custom() with my urls to the S3 bucket.
Using a dataset in streaming mode, the StreamingDownloadManager class doesn’t have such a download_custom() method, but only a download() method.

I don’t know if I trying to do something naughty. In this case, What do I need to do ?
or,
This is really a method we want to add to the class but We can’t because of .

Thanks in advance for your answers

mariosasko · December 22, 2022, 7:38pm

Hi! We should probably implement StreamingDownloadManager.download_custom considering we’ve got several requests for it.

In the meantime, you can use this pattern in your script:

import s3fs

def _split_generators(self, dl_manager):
    ...
    if not dl_manager.is_streaming:
        data_path = dl_manager.download_custom(s3_url, custom_download_func)
    else:
        # convert s3_url to fsspec's format ("s3://path/to/bucket")
        # authentication explained here: https://s3fs.readthedocs.io/en/latest/#credentials
        data_path = convert_s3_url_to_fsspec_path(s3_url)
    ...

kevin-guillet · January 2, 2023, 3:55pm

Thank you for your answer, yes a possible solution is to use s3fs for fsspec file loading.
I will follow your developments in case you implement download_custom() method for StreamingDownloadManager

lhoestq · January 3, 2023, 10:38am

I guess an even better solution would be to support S3 directly in dl_manager.download ^^

sl02 · March 14, 2023, 6:27am

@mariosasko

Could you please share an example of how to convert the s3 url to fsspec format? I looked the the document in the link but that didn’t have any example for objects stored in a private s3 bucket.

lhoestq · March 24, 2023, 10:36am

the S3 url should look like s3://bucket-name/path/to/data

ProgramComputer · August 20, 2023, 5:02am

I have noticed that download_custom is annotated as deprecated. How would I be able to stream a split zip file with just download and download_and_extract?

mariosasko · August 21, 2023, 5:29pm

Which tool did you use to create these “split zip files”? It’s best to use a (CLI) tool such as zip that splits an archive into valid zip chunks. Then, these chunks can be passed to dl_manager.download_and_extract.

ProgramComputer · August 21, 2023, 5:48pm

I used the split tool. I have since used the zipsplit tool as you mentioned. Thanks for the suggestion.

Topic		Replies	Views
Streaming support with download_custom 🤗Datasets	1	288	July 7, 2023
How to write a dataset load script using private S3 storage 🤗Datasets	2	1347	December 1, 2022
Making datasets work with both streaming=True and streaming=False 🤗Datasets	0	203	October 25, 2023
Download only a subset of a split 🤗Datasets	10	16748	February 25, 2025
Stream image dataset from (Azure) cloud storage 🤗Datasets	3	485	January 8, 2024

Download_custom method of StreamingDownloadManager not implemented

Related topics