How to use S3 path with `load_dataset` with streaming=True?

alvations · November 9, 2022, 10:18am

Maybe a related feature request/question:

Can `load_dataset` support `_io.TextIOWrapper` types instead of file paths?

Currently, filepath is supported:

infile = "../input/tatoeba/tatoeba-sentpairs.tsv" 
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

but sometimes, it’s very useful if we can do something like:

infile = open("../input/tatoeba/tatoeba-sentpairs.tsv")  # -> _io.TextIOWrapper
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

If we can do the latter, then maybe we can pipe/stream through s3 like this:

from datasets.filesystem import S3FileSystem

s3 = S3FileSystem()

infile = s3.open("../input/tatoeba/tatoeba-sentpairs.tsv") # -> _io.TextIOWrapper
train_data = load_dataset("csv", data_files=infile, 
                  streaming=True, delimiter="\t", split="train")

Topic		Replies	Views
Help creating dataset from s3 bucket with parquet files 🤗Datasets	2	1153	July 27, 2023
How to write a dataset load script using private S3 storage 🤗Datasets	2	1382	December 1, 2022
Host and share datasets: S3 🤗Datasets	1	1219	July 22, 2022
Does huggingface support load raw text dataset from hdfs? 🤗Datasets	3	1307	January 9, 2022
Stream Audio Dataset that Can't be moved to Hub 🤗Datasets	7	512	March 17, 2023

How to use S3 path with `load_dataset` with streaming=True?

Can load_dataset support _io.TextIOWrapper types instead of file paths?

Related topics

Can `load_dataset` support `_io.TextIOWrapper` types instead of file paths?