Dataset.map() with batching and multiprocessing

varadhbhatnagar · March 1, 2024, 6:57am

I am seeing different results when I do

dataset.map(..., batched=True, num_proc=4)
vs
dataset.map(..., batched=True, num_proc=16)

Here is the output:

Map (num_proc=4): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 1019.49 examples/s]
Map (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 1684.70 examples/s]
Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 148
}) Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 143
})

Is it expected to have different num_rows in the output?

varadhbhatnagar · March 5, 2024, 6:31am

@muellerzr Any thoughts on this?

Topic		Replies	Views
Ideal batch_size and writer_batch_size for datasets 🤗Datasets	1	1756	December 9, 2022
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	2040	April 22, 2022
Using num_proc>1 in Dataset.map hangs 🤗Datasets	8	4241	August 19, 2024
Progress bar of dataset.map with num_proc>1 hangs 🤗Datasets	2	1376	December 6, 2023
Map with num_proc over 1 fails 🤗Datasets	1	202	April 24, 2024

Dataset.map() with batching and multiprocessing

Related topics