The s3://... paths are supported in other places in the library (like save_to_disk and download_and_prepare), but not yet for load_dataset.
If you want to create a dataset programmatically using s3fs or other tools, you can define a generaror function in python and give it to Dataset.from_generator
I’m also needing similar functionality. @lhoestq could you make a recommendation?
I have ~1,000 parquet files that were created from pyarrow and are saved as a nested structure in GCS (i.e. calling pyarrow.parquet.Dataset(f"{name_of_bucket}/{name_of_root_dir_for_parquet_dataset}") automatically infers the relationship of all sub parquet files).
Constraints
Each parquet files is 0.5-1GB (accordingly, difficult to get the entire dataset to be on a VMs hard disk, let alone in memory)
Need to perform preprocessing on the dataset as a whole
Ideas
a) Use Dataset.from_generator() and create a generator does something like
# is it possible for this generator to benefit from streaming?
def gen():
parquet_dataset = pq.Dataset(uri_dir, fs=gcs_fs)
for fragment in parquet_dataset.get_fragments(): # iterates over constituent parquet files
fragment_table = fragment.to_table() # this is slow as parquet files are large
data = fragment_table.to_pydict()
for idx in range(len(data['x'])):
yield data['x'][idx]
dataset = Dataset.from_generator(gen) # does this fully enumerate the generator in order to return a dataset object?
dataset = dataset.map(...)
dataset.save_to_disk(...) # based on my understanding I'll now be able to load from this save path without having to construct the generator in the future
b) Write custom loading script that loads each parquet file as a pyarrow table, pass that directly to Dataset constructor, and concatenate all the resulting datasets.
c) Does it make sense to use dask here? If so could you point to an implementation a bit more through than that in the ‘cloud storage’ tab in docs.
I also have the control to change this pipeline upstream to make this task of loading the dataset and preprocessing it easier, does that sound like a better path. If so, how?
Have been stuck on this for some time, can’t express how much the help is appreciated enough!
This is feature is so much desired. I have a ton of files in s3 which I want to stream without downloading. I will try contributing to the feature. Thanks