How to use S3 path with `load_dataset` with streaming=True?

alvations · November 8, 2022, 1:38pm

When loading local files, we can try:

train_data = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv", 
                  streaming=True, delimiter="\t", split="train")

There’s a load_from_disk function from Cloud storage but I can’t find any documentation on load_datasets with the fs=s3fs.S3FileSystem(...)" option.

Is there way to retrieve the file and still stream it from S3 or other cloud storage and do something like this with streaming=True?

import s3fs
fs = s3fs.S3FileSystem(...)

train_data = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv", 
                  streaming=True, delimiter="\t", split="train", fs=fs)

alvations · November 9, 2022, 10:18am

Maybe a related feature request/question:

Can `load_dataset` support `_io.TextIOWrapper` types instead of file paths?

Currently, filepath is supported:

infile = "../input/tatoeba/tatoeba-sentpairs.tsv" 
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

but sometimes, it’s very useful if we can do something like:

infile = open("../input/tatoeba/tatoeba-sentpairs.tsv")  # -> _io.TextIOWrapper
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

If we can do the latter, then maybe we can pipe/stream through s3 like this:

from datasets.filesystem import S3FileSystem

s3 = S3FileSystem()

infile = s3.open("../input/tatoeba/tatoeba-sentpairs.tsv") # -> _io.TextIOWrapper
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

lhoestq · November 15, 2022, 5:50pm

Ideally load_dataset could support s3://... paths. Cc @delip this is similar to what we discussed today for gs://... paths

alvations · November 16, 2022, 9:50am

Thanks for the note on s3://... or gs://....

Just to clarify my understanding, is the cloud storage paths already supported? Or it is a feature suggestion?

If it’s already supported, how would the usage look like? Does the syntax look any of these:

from datasets.filesystem import S3FileSystem

s3 = S3FileSystem()

train_data = load_dataset("csv", data_files="s3://tatoeba/tatoeba-sentpairs.tsv", 
                  streaming=True, delimiter="\t", split="train", fs=s3)

or

from datasets.filesystem import S3FileSystem
s3 = S3FileSystem()

train_data = load_dataset("csv", data_files=s3.open("tatoeba/tatoeba-sentpairs.tsv"), 
                  streaming=True, delimiter="\t", split="train")

or is it something else?

lhoestq · November 16, 2022, 10:32am

The s3://... paths are supported in other places in the library (like save_to_disk and download_and_prepare), but not yet for load_dataset.

If you want to create a dataset programmatically using s3fs or other tools, you can define a generaror function in python and give it to Dataset.from_generator

alvations · November 16, 2022, 10:38am

Thanks for the clarification!

I’ll try using the from_generator together with one of these mode from Choose the best data source for your Amazon SageMaker training job | AWS Machine Learning Blog and see whether it works.

alvations · November 16, 2022, 2:50pm

Is there some pointer to using Dataset.from_generator() function?

when the .from_generator(), e.g.

import pandas as pd
from datasets import Dataset
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

with s3.open("s3://mydata/data.tsv") as fin:
    df = pd.read_csv(fin, sep='\t', chunksize=50)  # df is iterable. 

ds = Dataset.from_generator(df)

it was throwing an error:

AttributeError: 'S3File' object has no attribute 'name'

lhoestq · November 21, 2022, 3:48pm

From a pandas dataframe you can do

ds = Dataset.from_pandas(df)

And from_generator actually takes a generator function as input, e.g.

def gen()
    s3 = S3FileSystem()

    with s3.open("s3://mydata/data.tsv") as fin:
        data = ...
        for example in data:
            yield example

brenton31 · November 21, 2022, 6:21pm

I’m also needing similar functionality. @lhoestq could you make a recommendation?

I have ~1,000 parquet files that were created from pyarrow and are saved as a nested structure in GCS (i.e. calling pyarrow.parquet.Dataset(f"{name_of_bucket}/{name_of_root_dir_for_parquet_dataset}") automatically infers the relationship of all sub parquet files).

Constraints

Each parquet files is 0.5-1GB (accordingly, difficult to get the entire dataset to be on a VMs hard disk, let alone in memory)
Need to perform preprocessing on the dataset as a whole

Ideas
a) Use Dataset.from_generator() and create a generator does something like

# is it possible for this generator to benefit from streaming? 
def gen():
     parquet_dataset = pq.Dataset(uri_dir, fs=gcs_fs)
     for fragment in parquet_dataset.get_fragments(): # iterates over constituent parquet files
          fragment_table = fragment.to_table() # this is slow as parquet files are large
          data = fragment_table.to_pydict()
          for idx in range(len(data['x'])):
               yield data['x'][idx]

dataset = Dataset.from_generator(gen) # does this fully enumerate the generator in order to return a dataset object?
dataset = dataset.map(...)
dataset.save_to_disk(...) # based on my understanding I'll now be able to load from this save path without having to construct the generator in the future

b) Write custom loading script that loads each parquet file as a pyarrow table, pass that directly to Dataset constructor, and concatenate all the resulting datasets.
c) Does it make sense to use dask here? If so could you point to an implementation a bit more through than that in the ‘cloud storage’ tab in docs.

I also have the control to change this pipeline upstream to make this task of loading the dataset and preprocessing it easier, does that sound like a better path. If so, how?

Have been stuck on this for some time, can’t express how much the help is appreciated enough!

lhoestq · November 22, 2022, 11:05am

a) sounds good for now: it will iterate over the generator and return a Dataset.

Though in the future we’ll probably support gs:// URLs and you’ll be able to to

load_dataset(f"gs://{name_of_bucket}/{name_of_root_dir_for_parquet_dataset}")

stay tuned !

SwatCat · November 22, 2022, 8:23pm

This is feature is so much desired. I have a ton of files in s3 which I want to stream without downloading. I will try contributing to the feature. Thanks

lhoestq · November 23, 2022, 9:57am

We’re starting to explore this, I created an issue here to follow the advancements: Support cloud storage in load_dataset · Issue #5281 · huggingface/datasets · GitHub

Topic		Replies	Views
Help creating dataset from s3 bucket with parquet files 🤗Datasets	2	1151	July 27, 2023
How to write a dataset load script using private S3 storage 🤗Datasets	2	1377	December 1, 2022
Host and share datasets: S3 🤗Datasets	1	1219	July 22, 2022
Does huggingface support load raw text dataset from hdfs? 🤗Datasets	3	1305	January 9, 2022
Stream Audio Dataset that Can't be moved to Hub 🤗Datasets	7	510	March 17, 2023

How to use S3 path with `load_dataset` with streaming=True?

Can load_dataset support _io.TextIOWrapper types instead of file paths?

Related topics

Can `load_dataset` support `_io.TextIOWrapper` types instead of file paths?