How to use S3 path with `load_dataset` with streaming=True?

alvations · November 16, 2022, 2:50pm

Is there some pointer to using Dataset.from_generator() function?

when the .from_generator(), e.g.

import pandas as pd
from datasets import Dataset
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

with s3.open("s3://mydata/data.tsv") as fin:
    df = pd.read_csv(fin, sep='\t', chunksize=50)  # df is iterable. 

ds = Dataset.from_generator(df)

it was throwing an error:

AttributeError: 'S3File' object has no attribute 'name'

Topic		Replies	Views
Help creating dataset from s3 bucket with parquet files 🤗Datasets	2	1120	July 27, 2023
How to write a dataset load script using private S3 storage 🤗Datasets	2	1354	December 1, 2022
Host and share datasets: S3 🤗Datasets	1	1213	July 22, 2022
Stream image dataset from (Azure) cloud storage 🤗Datasets	3	490	January 8, 2024
Stream Audio Dataset that Can't be moved to Hub 🤗Datasets	7	491	March 17, 2023

How to use S3 path with `load_dataset` with streaming=True?

Related topics