Help creating dataset from s3 bucket with parquet files

Hi! I have a bunch of parquet files with the same schema in a S3 bucket. Is there a way to load a dataset using load_dataset?

I’m doing this but is not working:

!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U datasets
!pip install s3fs

import s3fs
from datasets import load_dataset

STORAGE_OPTIONS = {
    "key": "...",
    "secret": "...,
}
S3_BUCKET="s3://..."
fs = s3fs.S3FileSystem(**STORAGE_OPTIONS)

data = load_dataset("parquet", data_files=f"{S3_BUCKET}*.parquet", storage_options=STORAGE_OPTIONS)

Support for fsspec filesystems is still experimental.

A quick test on a public S3 bucket shows there are still some issues to address to support S3 paths. I’m fixing them in Fix `fsspec` download by mariosasko · Pull Request #6085 · huggingface/datasets · GitHub (should be merged tomorrow). Then, it will be possible to download these files by installing datasets from main.

1 Like

Awesome @mariosasko, I’ll be waiting for that then. Do you think my code setup is correct? Also, In case you have time, is any other approach that you consider will work fine in my scenario?