I have a file dataset (CSV or JSONL) hosted on a private S3 bucket And I wrote a script hosted in my HF Dataset repository (the script is straightforward but I can show it to you if necessary).
The fact is my dataset is really heavy and I want to use it via Dataset streaming mode:
A private hosted dataset need to be downloaded using DownloadManager.download_custom() with my urls to the S3 bucket.
Using a dataset in streaming mode, the StreamingDownloadManager class doesn’t have such a download_custom() method, but only a download() method.
I don’t know if I trying to do something naughty. In this case, What do I need to do ?
or,
This is really a method we want to add to the class but We can’t because of .
Thank you for your answer, yes a possible solution is to use s3fs for fsspec file loading.
I will follow your developments in case you implement download_custom() method for StreamingDownloadManager
Could you please share an example of how to convert the s3 url to fsspec format? I looked the the document in the link but that didn’t have any example for objects stored in a private s3 bucket.
I have noticed that download_custom is annotated as deprecated. How would I be able to stream a split zip file with just download and download_and_extract?
Which tool did you use to create these “split zip files”? It’s best to use a (CLI) tool such as zip that splits an archive into valid zip chunks. Then, these chunks can be passed to dl_manager.download_and_extract.