Making datasets work with both streaming=True and streaming=False

augustoperes · October 25, 2023, 11:11am

I have a dataset with the following structure on a google cloud bucket:

v0.1
    ├── 102042348
    │   ├── edges.npy
    │   ├── nodes.npy
    │   ├── wind_pressures.npy
    │   └── wind_velocities.npy
    ├── 102042349
    │   ├── edges.npy
    │   ├── nodes.npy
    │   ├── wind_pressures.npy
    │   └── wind_velocities.npy
    ├── 102042350
    │   ├── edges.npy
    │   ├── nodes.npy
    │   ├── wind_pressures.npy
    │   └── wind_velocities.npy

But I am having troubles creating a _split_generators function that works with both streaming=True and streaming=False. At the moment I have this:

    def _split_generators(self, dl_manager):
        # Download and extract the zip file in the bucket.
        downloaded_dir = dl_manager.download_and_extract(self.bucket_url)

        dirs = [
            os.path.join(downloaded_dir, dir_)
            for dir_ in os.listdir(downloaded_dir)
        ]

        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN,
                                    gen_kwargs={'sim_dir_paths': dirs}),
        ]

Which works with streaming=True but fails for streaming=False with the error:

File "/home/augusto/Documents/api-data-generation/.env/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1011, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('https://storage.googleapis.com/download/storage/v1/b/*********%2F?alt=media')

Interestingly enough, if I zip the dataset and upload it to the bucket the exact same function works for streaming=False but not for streaming=True.

So my question is: How do I get this to work with both streaming=False and streaming=True at the same time?

Topic		Replies	Views
Streaming and creating refactored dataset with shards using Generator 🤗Datasets	4	225	October 30, 2024
Possible to stream and create new splits? 🤗Datasets	1	381	January 4, 2024
Incrementally adding processed examples to a dataset 🤗Datasets	4	1392	June 23, 2022
Splitting dataset from generator 🤗Datasets	3	1876	January 26, 2023
How to use split_dataset_by_node and shuffle on iterable dataset 🤗Datasets	3	547	February 17, 2025

Making datasets work with both streaming=True and streaming=False

Related topics