Saving Datasets vs Dataset Cache

priyammaz · February 10, 2024, 1:08am

I am working on the common voice dataset and am only interested in the samples that have a labeled accent, so I am filtering the dataset for that. I’m not sure though if it is better to just use the save_to_disk/load_to_disk method or just depend on the cache which loads it pretty quickly. I don’t have a ton of disk space, If I do save the filtered dataset can I just delete the full dataset sitting in the cache after? Just not really sure what to do based on the docs!

gugaio · February 10, 2024, 1:56am

save_to_disk/load_to_disk method is common used to run in production or if you have a huge dataset with many GBs. Common voice is not huge so I believe you should use Dataset Cache. Simpler and fast.

Topic		Replies	Views
Understanding the `Datasets` cache system 🤗Datasets	2	3223	May 19, 2023
Caching only one feature, from a read-only dataset 🤗Datasets	5	38	April 7, 2025
Question about streaming 🤗Datasets	3	573	April 25, 2023
How to save/use only the first 20k samples of a dataset 🤗Datasets	1	63	December 23, 2024
Support of very large dataset? 🤗Datasets	12	10350	August 24, 2022

Saving Datasets vs Dataset Cache

Related topics