Best practice for saving large datasets to a cloud storage

I wanted to combine multiple large datasets and do some processing on them. for example one of them is OSCAR Dataset. I thought of loading this dataset on a VM(for example EC2) and then saving each record on cloud storage like S3 buckets. But for multiple large datasets(~200GB of text), this procedure will cost both money and time.
I wanted to know if is there a better way to do this.

So with some search and testing, I found a solution that looks good so sharing it here:

How to download/transfer HF dataset to S3 bucket


from botocore.session import Session
import s3fs
from datasets import load_dataset_builder

First enter your credentials(AWS Access Key ID, AWS Secret Access Key):

storage_options = {"key": "XXX",

"secret": "XXX"}

Then instantiate a boto session

s3_session = Session(profile="your_profile_name")

storage_options = {"session": s3_session}

Heads UP!
to create profile name enter this command

aws configure --profile "your_profile_name"

After that create file system

fs = s3fs.S3FileSystem(**storage_options)

and then download your data set in a bucket of S3:

builder = load_dataset_builder("your dataset id")

output_dir = "s3://path/to/the/bucket/"

builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")

HF Cloud storage

1 Like