Save map cache to s3 bucket

I’m currently making use of gcsfuse to load data from arrow tables stored in a Google Cloud Storage bucket mounted to a vm folder. The tables were saved automatically when the map() function cached the datasets to the mounted folder which is all well and good but gcsfuse often struggles with large datasets (io errors, sluggish, etc.).

Is there a way of using the equivalent of save_to_disk for the map() function’s caching? That way I could use the s3 filesystem to load the tables in a more stable fashion.

Hi ! When do you experience slow downs or error exactly ?

Have you tried making map write to a local directory first (you can pass cache_file_name to map to specify where to write the Arrow data), and then save it to your cloud storage using save_to_disk ?

This way accessing the data from your processed dataset will be much faster since your dataset will be on your disk.

Thanks for the reply! I should have mentioned that the VM has 100GB of disk space in total and the dataset in question is larger than this. By a factor of 5 or so. There is the option of making a new one but I’d like that to be a last resort if possible.

However, now that you mention it, there’s no reason why I can’t just map smaller batches of the data frame it originates from and save_to_disk each local table before deleting it - then using concatenate_datasets afterwards. Very messy though, seeing as each mapping process already uses 30 workers (along with concatenate_datasets).

The error I get for writing to the mounted folder is along the lines of IOError: [Errno 5] Input/output error which is also common when writing large files to a mounted Google Drive folder when using Colab. The slow-downs refer to timeouts when reading the arrow files. The kernel won’t stop but I need to run some of the cells again which means having to continuously monitor the process (which takes a long time).