How can I convert a loaded dataset in to a parquet file and save it to the S3

Currently, I can use dataset.save_to_disk(“s3://…”) to directly save to the s3 buckets as arrow files. But how to save it as a parquet file?

to_parquet method fails to save directly to the s3 bucket.

Currently, the only option is to save them locally and then upload them to a S3 bucket.

I opened an issue as this would be useful to support: Support `fsspec` in `Dataset.to_<format>` methods · Issue #6086 · huggingface/datasets · GitHub.

1 Like


What I do is, use to_parque and then boto3.upload.