How to use S3 path with `load_dataset` with streaming=True?

When loading local files, we can try:

train_data = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv", 
                  streaming=True, delimiter="\t", split="train")

There’s a load_from_disk function from Cloud storage but I can’t find any documentation on load_datasets with the fs=s3fs.S3FileSystem(...)" option.

Is there way to retrieve the file and still stream it from S3 or other cloud storage and do something like this with streaming=True?

import s3fs
fs = s3fs.S3FileSystem(...)

train_data = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv", 
                  streaming=True, delimiter="\t", split="train", fs=fs)

Maybe a related feature request/question:

Can load_dataset support _io.TextIOWrapper types instead of file paths?

Currently, filepath is supported:

infile = "../input/tatoeba/tatoeba-sentpairs.tsv" 
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

but sometimes, it’s very useful if we can do something like:

infile = open("../input/tatoeba/tatoeba-sentpairs.tsv")  # -> _io.TextIOWrapper
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

If we can do the latter, then maybe we can pipe/stream through s3 like this:

from datasets.filesystem import S3FileSystem

s3 = S3FileSystem()

infile = s3.open("../input/tatoeba/tatoeba-sentpairs.tsv") # -> _io.TextIOWrapper
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

Ideally load_dataset could support s3://... paths. Cc @delip this is similar to what we discussed today for gs://... paths

2 Likes

Thanks for the note on s3://... or gs://....

Just to clarify my understanding, is the cloud storage paths already supported? Or it is a feature suggestion?

If it’s already supported, how would the usage look like? Does the syntax look any of these:

from datasets.filesystem import S3FileSystem

s3 = S3FileSystem()

train_data = load_dataset("csv", data_files="s3://tatoeba/tatoeba-sentpairs.tsv", 
                  streaming=True, delimiter="\t", split="train", fs=s3)

or

from datasets.filesystem import S3FileSystem
s3 = S3FileSystem()

train_data = load_dataset("csv", data_files=s3.open("tatoeba/tatoeba-sentpairs.tsv"), 
                  streaming=True, delimiter="\t", split="train")

or is it something else?

The s3://... paths are supported in other places in the library (like save_to_disk and download_and_prepare), but not yet for load_dataset.

If you want to create a dataset programmatically using s3fs or other tools, you can define a generaror function in python and give it to Dataset.from_generator

1 Like

Thanks for the clarification!

I’ll try using the from_generator together with one of these mode from Choose the best data source for your Amazon SageMaker training job | AWS Machine Learning Blog and see whether it works.

Is there some pointer to using Dataset.from_generator() function?

when the .from_generator(), e.g.

import pandas as pd
from datasets import Dataset
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

with s3.open("s3://mydata/data.tsv") as fin:
    df = pd.read_csv(fin, sep='\t', chunksize=50)  # df is iterable. 

ds = Dataset.from_generator(df)

it was throwing an error:

AttributeError: 'S3File' object has no attribute 'name'

From a pandas dataframe you can do

ds = Dataset.from_pandas(df)

And from_generator actually takes a generator function as input, e.g.

def gen()
    s3 = S3FileSystem()

    with s3.open("s3://mydata/data.tsv") as fin:
        data = ...
        for example in data:
            yield example
1 Like

I’m also needing similar functionality. @lhoestq could you make a recommendation?

I have ~1,000 parquet files that were created from pyarrow and are saved as a nested structure in GCS (i.e. calling pyarrow.parquet.Dataset(f"{name_of_bucket}/{name_of_root_dir_for_parquet_dataset}") automatically infers the relationship of all sub parquet files).

Constraints

  • Each parquet files is 0.5-1GB (accordingly, difficult to get the entire dataset to be on a VMs hard disk, let alone in memory)
  • Need to perform preprocessing on the dataset as a whole

Ideas
a) Use Dataset.from_generator() and create a generator does something like

# is it possible for this generator to benefit from streaming? 
def gen():
     parquet_dataset = pq.Dataset(uri_dir, fs=gcs_fs)
     for fragment in parquet_dataset.get_fragments(): # iterates over constituent parquet files
          fragment_table = fragment.to_table() # this is slow as parquet files are large
          data = fragment_table.to_pydict()
          for idx in range(len(data['x'])):
               yield data['x'][idx]

dataset = Dataset.from_generator(gen) # does this fully enumerate the generator in order to return a dataset object?
dataset = dataset.map(...)
dataset.save_to_disk(...) # based on my understanding I'll now be able to load from this save path without having to construct the generator in the future
               

b) Write custom loading script that loads each parquet file as a pyarrow table, pass that directly to Dataset constructor, and concatenate all the resulting datasets.
c) Does it make sense to use dask here? If so could you point to an implementation a bit more through than that in the ‘cloud storage’ tab in docs.

I also have the control to change this pipeline upstream to make this task of loading the dataset and preprocessing it easier, does that sound like a better path. If so, how?

Have been stuck on this for some time, can’t express how much the help is appreciated enough!

a) sounds good for now: it will iterate over the generator and return a Dataset.

Though in the future we’ll probably support gs:// URLs and you’ll be able to to

load_dataset(f"gs://{name_of_bucket}/{name_of_root_dir_for_parquet_dataset}")

stay tuned !

3 Likes

This is feature is so much desired. I have a ton of files in s3 which I want to stream without downloading. I will try contributing to the feature. Thanks

We’re starting to explore this, I created an issue here to follow the advancements: Support cloud storage in load_dataset · Issue #5281 · huggingface/datasets · GitHub

1 Like