Hi !
Q1: no, this is not necessary. It used to be necessary though when contributing a dataset to https://github.com.huggingface/datasets though
Q2: indeed, map
create arrow files to store the output of your map
function. I would suggest you to delete the cache files you don’t need anymore to save some space. For example you can check the cache files used by the unprocessed dataset (before map
) with dataset.cache_files
, and delete those once you have your processed dataset. You can also save your processed dataset somewhere with dataset.save_to_disk
.