Best practice for saving large datasets to a cloud storage

I wanted to combine multiple large datasets and do some processing on them. for example one of them is OSCAR Dataset. I thought of loading this dataset on a VM(for example EC2) and then saving each record on cloud storage like S3 buckets. But for multiple large datasets(~200GB of text), this procedure will cost both money and time.
I wanted to know if is there a better way to do this.

So with some search and testing, I found a solution that looks good so sharing it here:

How to download/transfer HF dataset to S3 bucket


from botocore.session import Session
import s3fs
from datasets import load_dataset_builder

First enter your credentials(AWS Access Key ID, AWS Secret Access Key):

storage_options = {"key": "XXX",

"secret": "XXX"}

Then instantiate a boto session

s3_session = Session(profile="your_profile_name")

storage_options = {"session": s3_session}

Heads UP!
to create profile name enter this command

aws configure --profile "your_profile_name"

After that create file system

fs = s3fs.S3FileSystem(**storage_options)

and then download your data set in a bucket of S3:

builder = load_dataset_builder("your dataset id")

output_dir = "s3://path/to/the/bucket/"

builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")

HF Cloud storage

1 Like

Hi! I’m trying to work with this code from Datalore (jupyter notebook service like Colab), and have issues with AWS profile:

/usr/bin/sh: 1: aws: not found

Do you know how to fix it? Maybe I should set up my AWS profile or bucket somehow?

@chudotony Did you install awscli in your environment?



I tried it after your comment, but it didn’t help

What’s the best practice for GCS? I tried downloading allenai/c4/en with builder.download_and_prepare(), it is very slow and and gets stuck. I’ve tried several times.