`load_dataset` results in OOM

sharvil-lmnt · June 25, 2024, 4:08pm

I have a few large (private) datasets that I want to concatenate. I can’t seem to load them without OOMing. Here’s an example of what I’m doing:

import datasets


all_datasets = []
for name in ['meow/private1', 'meow/private2', 'meow/private3', 'meow/private4']:
  ds = datasets.load_dataset(name)
  all_datasets.append(ds)
all_datasets = datasets.concatenate_datasets(all_datasets)  # This code is not reached

The datasets do not have a custom loading script. They’re cached locally as a set of Arrow files. From what I can tell in the docs, the files should be mmaped into my process and demand-loaded from disk. Notice that I’m not even touching the data so I’m not sure what’s being loaded into memory. I’ve also tried setting datasets.config.IN_MEMORY_MAX_SIZE to 1GB with no luck.

Is my understanding incorrect? Does load_dataset always attempt to read the entire dataset into memory? Is there any way to load these datasets as map-style datasets (i.e. not IterableDataset)?

Topic		Replies	Views
How to concatenate 100s of small datasets into a very large dataset? Without loading into memory? 🤗Datasets	1	430	May 18, 2023
[Bug?] Datasets map and concatenation after sharding OOM 🤗Datasets	1	31	September 4, 2024
How to handle big data? 🤗Datasets	7	1647	June 15, 2023
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3739	May 16, 2022
Loading a dataset doesn't actually memory map 🤗Datasets	1	935	September 4, 2023

`load_dataset` results in OOM

Related topics