Download_custom method of StreamingDownloadManager not implemented

Hello everyone,

I have a file dataset (CSV or JSONL) hosted on a private S3 bucket And I wrote a script hosted in my HF Dataset repository (the script is straightforward but I can show it to you if necessary).

The fact is my dataset is really heavy and I want to use it via Dataset streaming mode:

dataset = load_dataset("my_dataset", streaming=True)

So my problem is:

  • A private hosted dataset need to be downloaded using DownloadManager.download_custom() with my urls to the S3 bucket.
  • Using a dataset in streaming mode, the StreamingDownloadManager class doesn’t have such a download_custom() method, but only a download() method.

I don’t know if I trying to do something naughty. In this case, What do I need to do ?
or,
This is really a method we want to add to the class but We can’t because of :eyes: .

Thanks in advance for your answers :smile:

Hi! We should probably implement StreamingDownloadManager.download_custom considering we’ve got several requests for it.

In the meantime, you can use this pattern in your script:

import s3fs

def _split_generators(self, dl_manager):
    ...
    if not dl_manager.is_streaming:
        data_path = dl_manager.download_custom(s3_url, custom_download_func)
    else:
        # convert s3_url to fsspec's format ("s3://path/to/bucket")
        # authentication explained here: https://s3fs.readthedocs.io/en/latest/#credentials
        data_path = convert_s3_url_to_fsspec_path(s3_url)
    ...
1 Like

Thank you for your answer, yes a possible solution is to use s3fs for fsspec file loading.
I will follow your developments in case you implement download_custom() method for StreamingDownloadManager :crossed_fingers:

I guess an even better solution would be to support S3 directly in dl_manager.download ^^

1 Like

@mariosasko

Could you please share an example of how to convert the s3 url to fsspec format? I looked the the document in the link but that didn’t have any example for objects stored in a private s3 bucket.

the S3 url should look like s3://bucket-name/path/to/data

I have noticed that download_custom is annotated as deprecated. How would I be able to stream a split zip file with just download and download_and_extract?

Which tool did you use to create these “split zip files”? It’s best to use a (CLI) tool such as zip that splits an archive into valid zip chunks. Then, these chunks can be passed to dl_manager.download_and_extract.

I used the split tool. I have since used the zipsplit tool as you mentioned. Thanks for the suggestion.