How does `datasets.Dataset.map` parallelize data?

As I read here dataset splits into num_proc parts and each part processes separately:

When num_proc > 1, map splits the dataset into num_proc shards, each of which is mapped to one of the num_proc workers. So in your case, this means that some workers finished processing their shards earlier than others.

Here is my code:

def _get_embeddings(texts):
    encoded_input = tokenizer(
        texts, padding=True, truncation=True, return_tensors='pt'
    )
    
    with torch.no_grad():
        encoded_input = {
            k: v.to('cpu') for k, v in encoded_input.items()
        }
        model_output = model(**encoded_input)
    
    return model_output.pooler_output.tolist()

checkpoint = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)

texts = [
    { 'text': 'first sentence' },
    { 'text': 'second sentence' },
#  ...
    { 'text': 'billionth sentence' }
]

dataset = Dataset.from_list(texts)

predictions = dataset.map(
    lambda x: {
        'embeddings': _get_embeddings(x['text'])
    },
    batched=True,
    batch_size=32,
    num_proc=4
)

Am I right that the code below will consume 4x more memory?

I mean how exactly this code should work? The model will be somehow copied to 4 Python processes and this will consume four times more RAM and I’ll get overhead time costs? Or all 4 Python processes have access to “shared” model?

1 Like

Hi!

You can expect this behavior by default:

The model will be somehow copied to 4 Python processes and this will consume four times more RAM and I’ll get overhead time costs

You can call share_memory on a CPU model or import multiprocess; multiprocess.set_start_method("spawn", force=True) in case of a GPU model to share the model across processes before launching a multiprocess map.

Was this issue solved? I’m facing the same problem even after using multiprocess.set_start_method("spawn", force=True)