Trying to figure out when is a dataset stored in memory?

oran-sh · June 29, 2023, 9:36am

Hello,
I’m trying to understand when is a hugging face dataset stored in memory, when is it stored in some cached directory, and when is it just read on the fly from storage?

More specifically:

When downloading a dataset from the hub with load_dataset, is the data stored in some cache dir and during training read into memory on the fly?
How about the following scenario for images:
I have a csv with file paths and labels.
I’ll use from_dict or from_pandas
Then I will cast_column of the file paths with Image().
Does this load all the examples to memory?
How about after casting with Image(), when I do:
dataset.set_format(‘torch’)
Does this now load all the data to the memory?

Eventually I’m trying to understand what’s the best way to go with very large datasets.

Thanks!

mariosasko · June 29, 2023, 1:32pm

You can use the .cache_files attribute to check whether a dataset is cached on disk (empty list if not):

load_dataset always writes a dataset on disk, and then the dataset is memory-mapped (chunks are read into memory when needed)
from_dict and from_pandas load data into memory. We always cache some ops on disk (e.g., map, filter), but only to a temporary directory (deleted on session exit) if running these ops on in-memory datasets.
No, set_format modifies the formatting logic when converting from Arrow to Python.

oran-sh · June 29, 2023, 2:24pm

Thanks for your reply Mario!
So, what is the way to go with custom very large datasets, that obviously can’t be loaded entirely to memory?

oran-sh · June 29, 2023, 2:29pm

So what is the way to go with custom very large datasets that obviously can’t be loaded to memory entirely, but examples should be loaded on the go?

mariosasko · June 29, 2023, 7:09pm

load_dataset (from data files or a dataset loading script) or Dataset.from_generator is the way to go when working with large datasets.

Topic		Replies	Views
In-memory dataset to disk for caching operations 🤗Datasets	1	924	May 2, 2022
Load dataset from a specific cache file 🤗Datasets	3	1232	February 26, 2024
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3739	May 16, 2022
Loading dataset from cache .arrow file 🤗Datasets	1	745	March 28, 2023
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2724	March 22, 2023

Trying to figure out when is a dataset stored in memory?

Related topics