I have a file dataset (CSV or JSONL) hosted on a private S3 bucket And I wrote a script hosted in my HF Dataset repository (the script is straightforward but I can show it to you if necessary).
The fact is my dataset is really heavy and I want to use it via Dataset streaming mode:
dataset = load_dataset("my_dataset", streaming=True)
So my problem is:
- A private hosted dataset need to be downloaded using
DownloadManager.download_custom() with my urls to the S3 bucket.
- Using a dataset in streaming mode, the
StreamingDownloadManager class doesn’t have such a
download_custom() method, but only a
I don’t know if I trying to do something naughty. In this case, What do I need to do ?
This is really a method we want to add to the class but We can’t because of .
Thanks in advance for your answers
Hi! We should probably implement
StreamingDownloadManager.download_custom considering we’ve got several requests for it.
In the meantime, you can use this pattern in your script:
def _split_generators(self, dl_manager):
if not dl_manager.is_streaming:
data_path = dl_manager.download_custom(s3_url, custom_download_func)
# convert s3_url to fsspec's format ("s3://path/to/bucket")
# authentication explained here: https://s3fs.readthedocs.io/en/latest/#credentials
data_path = convert_s3_url_to_fsspec_path(s3_url)
Thank you for your answer, yes a possible solution is to use s3fs for fsspec file loading.
I will follow your developments in case you implement
download_custom() method for StreamingDownloadManager
I guess an even better solution would be to support S3 directly in
Could you please share an example of how to convert the s3 url to fsspec format? I looked the the document in the link but that didn’t have any example for objects stored in a private s3 bucket.
the S3 url should look like