Best practice for saving large datasets to a cloud storage

saied · October 11, 2022, 1:39pm

Hi,
I wanted to combine multiple large datasets and do some processing on them. for example one of them is OSCAR Dataset. I thought of loading this dataset on a VM(for example EC2) and then saving each record on cloud storage like S3 buckets. But for multiple large datasets(~200GB of text), this procedure will cost both money and time.
I wanted to know if is there a better way to do this.
Thanks

saied · October 11, 2022, 3:18pm

So with some search and testing, I found a solution that looks good so sharing it here:

How to download/transfer HF dataset to S3 bucket

Imports:

from botocore.session import Session
import s3fs
from datasets import load_dataset_builder

First enter your credentials(AWS Access Key ID, AWS Secret Access Key):

storage_options = {"key": "XXX",

"secret": "XXX"}

Then instantiate a boto session

s3_session = Session(profile="your_profile_name")

storage_options = {"session": s3_session}

Heads UP!
to create profile name enter this command

aws configure --profile "your_profile_name"

After that create file system

fs = s3fs.S3FileSystem(**storage_options)

and then download your data set in a bucket of S3:

builder = load_dataset_builder("your dataset id")

output_dir = "s3://path/to/the/bucket/"

builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")

Ref:
HF Cloud storage

chudotony · March 7, 2023, 10:58am

Hi! I’m trying to work with this code from Datalore (jupyter notebook service like Colab), and have issues with AWS profile:

/usr/bin/sh: 1: aws: not found

Do you know how to fix it? Maybe I should set up my AWS profile or bucket somehow?

saied · March 30, 2023, 1:06pm

@chudotony Did you install awscli in your environment?

chudotony · April 13, 2023, 9:03am

chudotony

1m

I tried it after your comment, but it didn’t help

aireenmei · April 3, 2024, 11:44pm

What’s the best practice for GCS? I tried downloading allenai/c4/en with builder.download_and_prepare(), it is very slow and and gets stuck. I’ve tried several times.

Topic		Replies	Views
How to write a dataset load script using private S3 storage 🤗Datasets	2	1344	December 1, 2022
How can I convert a loaded dataset in to a parquet file and save it to the S3 🤗Datasets	2	4327	July 31, 2023
Host and share datasets: S3 🤗Datasets	1	1207	July 22, 2022
Help creating dataset from s3 bucket with parquet files 🤗Datasets	2	1093	July 27, 2023
How do you save an IterableDataset to disk? 🤗Datasets	3	764	November 18, 2024

Best practice for saving large datasets to a cloud storage

How to download/transfer HF dataset to S3 bucket

Related topics