Save map cache to s3 bucket

Ollie · September 9, 2021, 7:03am

I’m currently making use of gcsfuse to load data from arrow tables stored in a Google Cloud Storage bucket mounted to a vm folder. The tables were saved automatically when the map() function cached the datasets to the mounted folder which is all well and good but gcsfuse often struggles with large datasets (io errors, sluggish, etc.).

Is there a way of using the equivalent of save_to_disk for the map() function’s caching? That way I could use the s3 filesystem to load the tables in a more stable fashion.

lhoestq · September 9, 2021, 4:40pm

Hi ! When do you experience slow downs or error exactly ?

Have you tried making map write to a local directory first (you can pass cache_file_name to map to specify where to write the Arrow data), and then save it to your cloud storage using save_to_disk ?

This way accessing the data from your processed dataset will be much faster since your dataset will be on your disk.

Ollie · September 9, 2021, 5:14pm

Thanks for the reply! I should have mentioned that the VM has 100GB of disk space in total and the dataset in question is larger than this. By a factor of 5 or so. There is the option of making a new one but I’d like that to be a last resort if possible.

However, now that you mention it, there’s no reason why I can’t just map smaller batches of the data frame it originates from and save_to_disk each local table before deleting it - then using concatenate_datasets afterwards. Very messy though, seeing as each mapping process already uses 30 workers (along with concatenate_datasets).

The error I get for writing to the mounted folder is along the lines of IOError: [Errno 5] Input/output error which is also common when writing large files to a mounted Google Drive folder when using Colab. The slow-downs refer to timeouts when reading the arrow files. The kernel won’t stop but I need to run some of the cells again which means having to continuously monitor the process (which takes a long time).

Thanks!

Topic		Replies	Views
How can I convert a loaded dataset in to a parquet file and save it to the S3 🤗Datasets	2	4314	July 31, 2023
Compressing, saving, and loading datasets 🤗Datasets	3	2253	November 10, 2020
Host and share datasets: S3 🤗Datasets	1	1204	July 22, 2022
Working with large datasets - cache issues 🤗Datasets	1	1025	June 1, 2022
Best practice for saving large datasets to a cloud storage 🤗Datasets	5	2217	April 3, 2024

Save map cache to s3 bucket

Related topics