Hi! I have a bunch of parquet files with the same schema in a S3 bucket. Is there a way to load a dataset using load_dataset
?
I’m doing this but is not working:
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U datasets
!pip install s3fs
import s3fs
from datasets import load_dataset
STORAGE_OPTIONS = {
"key": "...",
"secret": "...,
}
S3_BUCKET="s3://..."
fs = s3fs.S3FileSystem(**STORAGE_OPTIONS)
data = load_dataset("parquet", data_files=f"{S3_BUCKET}*.parquet", storage_options=STORAGE_OPTIONS)
Support for fsspec
filesystems is still experimental.
A quick test on a public S3 bucket shows there are still some issues to address to support S3 paths. I’m fixing them in Fix `fsspec` download by mariosasko · Pull Request #6085 · huggingface/datasets · GitHub (should be merged tomorrow). Then, it will be possible to download these files by installing datasets
from main
.
1 Like
Awesome @mariosasko, I’ll be waiting for that then. Do you think my code setup is correct? Also, In case you have time, is any other approach that you consider will work fine in my scenario?