Multiprocessing map taking too much memory footprint


I use map to preprocess my super large datasets(about 450G). I run map function on 1 node/1 gpu to finish tokenization, while I got stuck at transferring to multi-node training on 10 nodes/ 8 gpus, num_proc = 18). I check my system status, and it seems that my memory is out of flow, which didn’t happen at 1node/1 gpu condition. How could I fix it? Thanks!

Here’s my 1 node/1 gpu training system status, out of flow happen when transferring to 8gpus

total Memory:950GB

I am wondering the memory footprint is normal or not. Why map take so much memory ( about 400G )? ? I’m very happy to provide some clue about my training.

Hi! What do you get when you access the cache_files attribute of your dataset? map with multiprocessing can be an issue for in-memory datasets due to data being copied to the subprocesses (more info).

Thank you for reply! @mariosasko
I’m not for sure about cache_files , but dataset should be cached to disk I guess? Cause there is some tips like “found cached files from…” before go map. I use map like this:

        with training_args.main_process_first(desc="train dataset map pre-processing"):
            train_dataset =
                load_from_cache_file=not data_args.overwrite_cache,
                desc="running tokenizer on train dataset",

The main process starts taking up a lot of memory after create num_proc workers to execute map, even though I have done map before, so there is nothing do but loading the cache files.
I found that subprocess in multi-gpu will execute ‘map’ too, and memory rate jump to 100%.

I saved the datasets after map , and load it directly (which means I don’t need map) and remove map from my scripts, everything just fine, so I think there are some problem with the workers in map

I have two question:

  1. do all workers share or copy the data?
  2. how to control the use of memory for every worker.
  1. do all workers share or copy the data?

When you load a dataset, it actually memory maps the data from your disk. All workers memory map the same files, so this is shared memory.

  1. how to control the use of memory for every worker.

It’s the sum of the physical memory used by your job, and the physical memory occupied by pages
from memory mapped files. Memory mapping will bring pages in physical memory when you access the dataset, and remove them from physical memory after a while if they’re not used. Pages are also automatically removed if your job requires a lot of memory.