Help creating dataset from s3 bucket with parquet files

jjovalle99 · July 27, 2023, 5:38pm

Hi! I have a bunch of parquet files with the same schema in a S3 bucket. Is there a way to load a dataset using load_dataset?

I’m doing this but is not working:

!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U datasets
!pip install s3fs

import s3fs
from datasets import load_dataset

STORAGE_OPTIONS = {
    "key": "...",
    "secret": "...,
}
S3_BUCKET="s3://..."
fs = s3fs.S3FileSystem(**STORAGE_OPTIONS)

data = load_dataset("parquet", data_files=f"{S3_BUCKET}*.parquet", storage_options=STORAGE_OPTIONS)

mariosasko · July 27, 2023, 7:01pm

Support for fsspec filesystems is still experimental.

A quick test on a public S3 bucket shows there are still some issues to address to support S3 paths. I’m fixing them in Fix `fsspec` download by mariosasko · Pull Request #6085 · huggingface/datasets · GitHub (should be merged tomorrow). Then, it will be possible to download these files by installing datasets from main.

jjovalle99 · July 27, 2023, 7:34pm

Awesome @mariosasko, I’ll be waiting for that then. Do you think my code setup is correct? Also, In case you have time, is any other approach that you consider will work fine in my scenario?

Topic		Replies	Views
How can I convert a loaded dataset in to a parquet file and save it to the S3 🤗Datasets	2	4413	July 31, 2023
How to use S3 path with `load_dataset` with streaming=True? 🤗Datasets	11	7748	November 23, 2022
How to write a dataset load script using private S3 storage 🤗Datasets	2	1355	December 1, 2022
FileNotFoundError after using builder.download_and_prepare() to S3 🤗Datasets	2	664	January 4, 2024
Host and share datasets: S3 🤗Datasets	1	1214	July 22, 2022

Help creating dataset from s3 bucket with parquet files

Related topics