I have a dataset with the following structure on a google cloud bucket:
v0.1
βββ 102042348
β βββ edges.npy
β βββ nodes.npy
β βββ wind_pressures.npy
β βββ wind_velocities.npy
βββ 102042349
β βββ edges.npy
β βββ nodes.npy
β βββ wind_pressures.npy
β βββ wind_velocities.npy
βββ 102042350
β βββ edges.npy
β βββ nodes.npy
β βββ wind_pressures.npy
β βββ wind_velocities.npy
But I am having troubles creating a _split_generators
function that works with both streaming=True
and streaming=False
. At the moment I have this:
def _split_generators(self, dl_manager):
# Download and extract the zip file in the bucket.
downloaded_dir = dl_manager.download_and_extract(self.bucket_url)
dirs = [
os.path.join(downloaded_dir, dir_)
for dir_ in os.listdir(downloaded_dir)
]
return [
datasets.SplitGenerator(name=datasets.Split.TRAIN,
gen_kwargs={'sim_dir_paths': dirs}),
]
Which works with streaming=True
but fails for streaming=False
with the error:
File "/home/augusto/Documents/api-data-generation/.env/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1011, in raise_for_status
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('https://storage.googleapis.com/download/storage/v1/b/*********%2F?alt=media')
Interestingly enough, if I zip the dataset and upload it to the bucket the exact same function works for streaming=False
but not for streaming=True
.
So my question is: How do I get this to work with both streaming=False
and streaming=True
at the same time?