Dataset map function takes forever to run!

I’m trying to pre-process my dataset for the Donut model and despite completeing the mapping it is running for about 100 mins -.-. I ran this with num_proc=2, not sure if setting it to all cpu cores would make much of a difference.

Any idea of how to fix this?

Hi! What does processor.tokenizer.is_fast return? If the returned value is True, it’s better not to use the num_proc parameter in map to benefit from the tokenizer’s parallelism. The “fast” tokenizers are written in Rust and process data in parallel by default, but this does work well in multi-process Python code, so we disable the “fast” tokenizers’ parallelism when num_proc>1 to avoid deadlocks.

Also, setting the return_tensors parameter to np should make the transform faster as PyArrow natively supports NumPy 1-D arrays, which avoids the torch → np conversion step.

1 Like

Thanks @mariosasko. Yes, the tokenizer is fast. The reason I ran this num_proc>1 because without it the code ate up all my RAM(32gb) and the kernel kept dying.