Best practice for saving large datasets to a cloud storage

saied · October 11, 2022, 1:39pm

Hi,
I wanted to combine multiple large datasets and do some processing on them. for example one of them is OSCAR Dataset. I thought of loading this dataset on a VM(for example EC2) and then saving each record on cloud storage like S3 buckets. But for multiple large datasets(~200GB of text), this procedure will cost both money and time.
I wanted to know if is there a better way to do this.
Thanks

Topic		Replies	Views
How to write a dataset load script using private S3 storage 🤗Datasets	2	1341	December 1, 2022
How can I convert a loaded dataset in to a parquet file and save it to the S3 🤗Datasets	2	4311	July 31, 2023
Host and share datasets: S3 🤗Datasets	1	1204	July 22, 2022
Help creating dataset from s3 bucket with parquet files 🤗Datasets	2	1081	July 27, 2023
How do you save an IterableDataset to disk? 🤗Datasets	3	738	November 18, 2024

Best practice for saving large datasets to a cloud storage

Related topics