Increase on disk space when using map() in Accelerate environment

Hi !

Q1: no, this is not necessary. It used to be necessary though when contributing a dataset to https://github.com.huggingface/datasets though

Q2: indeed, map create arrow files to store the output of your map function. I would suggest you to delete the cache files you don’t need anymore to save some space. For example you can check the cache files used by the unprocessed dataset (before map) with dataset.cache_files, and delete those once you have your processed dataset. You can also save your processed dataset somewhere with dataset.save_to_disk.