Hello,
I’m trying to understand when is a hugging face dataset stored in memory, when is it stored in some cached directory, and when is it just read on the fly from storage?
More specifically:
- When downloading a dataset from the hub with load_dataset, is the data stored in some cache dir and during training read into memory on the fly?
- How about the following scenario for images:
I have a csv with file paths and labels.
I’ll use from_dict or from_pandas
Then I will cast_column of the file paths with Image().
Does this load all the examples to memory?
- How about after casting with Image(), when I do:
dataset.set_format(‘torch’)
Does this now load all the data to the memory?
Eventually I’m trying to understand what’s the best way to go with very large datasets.
Thanks!
You can use the .cache_files
attribute to check whether a dataset is cached on disk (empty list if not):
load_dataset
always writes a dataset on disk, and then the dataset is memory-mapped (chunks are read into memory when needed)
from_dict
and from_pandas
load data into memory. We always cache some ops on disk (e.g., map
, filter
), but only to a temporary directory (deleted on session exit) if running these ops on in-memory datasets.
- No,
set_format
modifies the formatting logic when converting from Arrow to Python.
1 Like
Thanks for your reply Mario!
So, what is the way to go with custom very large datasets, that obviously can’t be loaded entirely to memory?
So what is the way to go with custom very large datasets that obviously can’t be loaded to memory entirely, but examples should be loaded on the go?
load_dataset
(from data files or a dataset loading script) or Dataset.from_generator
is the way to go when working with large datasets.
1 Like