In-memory dataset to disk for caching operations

Sanderbaduk · May 2, 2022, 7:37am

I am creating datasets from pandas dataframes a lot, as it’s simply an easier way to preserve columns that are arrays and such. However, I notice my .map operations aren’t getting cached.
How can I turn an in memory dataset to a disk-based one for caching (or directly load a dataframe as such)

mariosasko · May 2, 2022, 12:16pm

Hi! In-memory datasets create a temporary cache bound to a python session. To cache operations permanently, save the dataset to disk with .save_to_disk("path/to/save/dir") and reload it with datasets.load_from_disk("path/to/save/dir") to get the version backed by an arrow file and then execute the ops on it again.

Topic		Replies	Views
Trying to figure out when is a dataset stored in memory? 🤗Datasets	4	1178	June 29, 2023
Dataset map() creates lot of cache files 🤗Datasets	6	6481	March 26, 2024
Keeping only current dataset state in cache 🤗Datasets	3	1300	August 30, 2022
Loading dataset from cache .arrow file 🤗Datasets	1	747	March 28, 2023
[urgent]Can you reconstruct datasets using the cache file (.arrow file)? 🤗Datasets	5	1074	August 27, 2021

In-memory dataset to disk for caching operations

Related topics