I am working on the common voice dataset and am only interested in the samples that have a labeled accent, so I am filtering the dataset for that. I’m not sure though if it is better to just use the save_to_disk/load_to_disk method or just depend on the cache which loads it pretty quickly. I don’t have a ton of disk space, If I do save the filtered dataset can I just delete the full dataset sitting in the cache after? Just not really sure what to do based on the docs!
save_to_disk/load_to_disk method is common used to run in production or if you have a huge dataset with many GBs. Common voice is not huge so I believe you should use Dataset Cache. Simpler and fast.