I have a few large (private) datasets that I want to concatenate. I can’t seem to load them without OOMing. Here’s an example of what I’m doing:
import datasets
all_datasets = []
for name in ['meow/private1', 'meow/private2', 'meow/private3', 'meow/private4']:
ds = datasets.load_dataset(name)
all_datasets.append(ds)
all_datasets = datasets.concatenate_datasets(all_datasets) # This code is not reached
The datasets do not have a custom loading script. They’re cached locally as a set of Arrow files. From what I can tell in the docs, the files should be mmaped into my process and demand-loaded from disk. Notice that I’m not even touching the data so I’m not sure what’s being loaded into memory. I’ve also tried setting datasets.config.IN_MEMORY_MAX_SIZE
to 1GB with no luck.
Is my understanding incorrect? Does load_dataset
always attempt to read the entire dataset into memory? Is there any way to load these datasets as map-style datasets (i.e. not IterableDataset
)?