I’m trying to pre-process my dataset for the Donut model and despite completeing the mapping it is running for about 100 mins -.-. I ran this with num_proc=2
, not sure if setting it to all cpu cores would make much of a difference.
Any idea of how to fix this?
Hi! What does processor.tokenizer.is_fast
return? If the returned value is True
, it’s better not to use the num_proc
parameter in map
to benefit from the tokenizer’s parallelism. The “fast” tokenizers are written in Rust and process data in parallel by default, but this does work well in multi-process Python code, so we disable the “fast” tokenizers’ parallelism when num_proc>1
to avoid deadlocks.
Also, setting the return_tensors
parameter to np
should make the transform faster as PyArrow natively supports NumPy 1-D arrays, which avoids the torch
→ np
conversion step.
1 Like
Thanks @mariosasko. Yes, the tokenizer is fast. The reason I ran this num_proc>1
because without it the code ate up all my RAM(32gb) and the kernel kept dying.