Trying to figure out when is a dataset stored in memory?

Hello,
I’m trying to understand when is a hugging face dataset stored in memory, when is it stored in some cached directory, and when is it just read on the fly from storage?

More specifically:

  1. When downloading a dataset from the hub with load_dataset, is the data stored in some cache dir and during training read into memory on the fly?
  2. How about the following scenario for images:
    I have a csv with file paths and labels.
    I’ll use from_dict or from_pandas
    Then I will cast_column of the file paths with Image().
    Does this load all the examples to memory?
  3. How about after casting with Image(), when I do:
    dataset.set_format(‘torch’)
    Does this now load all the data to the memory?

Eventually I’m trying to understand what’s the best way to go with very large datasets.

Thanks!

You can use the .cache_files attribute to check whether a dataset is cached on disk (empty list if not):

  1. load_dataset always writes a dataset on disk, and then the dataset is memory-mapped (chunks are read into memory when needed)
  2. from_dict and from_pandas load data into memory. We always cache some ops on disk (e.g., map, filter), but only to a temporary directory (deleted on session exit) if running these ops on in-memory datasets.
  3. No, set_format modifies the formatting logic when converting from Arrow to Python.
1 Like

Thanks for your reply Mario!
So, what is the way to go with custom very large datasets, that obviously can’t be loaded entirely to memory?

So what is the way to go with custom very large datasets that obviously can’t be loaded to memory entirely, but examples should be loaded on the go?

load_dataset (from data files or a dataset loading script) or Dataset.from_generator is the way to go when working with large datasets.

1 Like