Hi,
I wanted to combine multiple large datasets and do some processing on them. for example one of them is OSCAR Dataset. I thought of loading this dataset on a VM(for example EC2) and then saving each record on cloud storage like S3 buckets. But for multiple large datasets(~200GB of text), this procedure will cost both money and time.
I wanted to know if is there a better way to do this.
Thanks
So with some search and testing, I found a solution that looks good so sharing it here:
How to download/transfer HF dataset to S3 bucket
Imports:
from botocore.session import Session
import s3fs
from datasets import load_dataset_builder
First enter your credentials(AWS Access Key ID, AWS Secret Access Key):
storage_options = {"key": "XXX",
"secret": "XXX"}
Then instantiate a boto session
s3_session = Session(profile="your_profile_name")
storage_options = {"session": s3_session}
Heads UP!
to create profile name enter this command
aws configure --profile "your_profile_name"
After that create file system
fs = s3fs.S3FileSystem(**storage_options)
and then download your data set in a bucket of S3:
builder = load_dataset_builder("your dataset id")
output_dir = "s3://path/to/the/bucket/"
builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
Ref:
HF Cloud storage
1 Like
Hi! I’m trying to work with this code from Datalore (jupyter notebook service like Colab), and have issues with AWS profile:
/usr/bin/sh: 1: aws: not found
Do you know how to fix it? Maybe I should set up my AWS profile or bucket somehow?