I’ve recently spent some time scratching my head over some tricky bugs in parallel dataset processing (with .map()
).
In particular at the moment, I have a use case with multiple .map()
pings where everything works fine (but very slowly) as long as num_proc=None
… But when scaling up to multiple workers, the system freezes part way through.
Specifically:
- The progress indicators always seem to suggest it gets to the end of one of the
map()
s at the point where the script becomes unresponsive: The last thing to print are full progress bars from the final busy workers. - Multiple cores (but not all / as many as I have workers) show as “busy”, but nothing else happens even after leaving the system for minutes or hours.
One reason I’m finding this tough to debug is it seems like there’s so many things that could go wrong in map multiprocessing: I’ve already seen, for example, that with multiple processes uncaught errors can sometimes silently crash workers without managing to print the error… Which is fine for things that can be debugged with num_proc=None
, but tricky after that!
Can anybody help narrow down the list: Which of these things do we know could, or wouldn’t cause these kinds of hanging errors in situations where a dataset is prepared through multiple .map()
stages:
- Are there particular thread-safety restrictions on
fn_kwargs
keyword args, assuming these are only read and not written by the mapper? E.g. is it OK to use kwargs with types likedict
,set
, maybenumpy.ndarray
? - Is it true a
TokenizerFast
should be usable in.map()
(passed as a kwarg and then called on each batch) without expecting any problems? I think theTOKENIZERS_PARALLELISM=false
env var may be required for this… - How about a
Processor
? Are there processors that might have a problem being used in this way? - In the actual result columns of the dataset returned by the map function, is it ok to return
ndarray
s? Or must be plain Python lists?
A). How about if multiple returned features are views that might map to shared underlying ndarray data? (For e.g. by the map fn preparing one big ndarray and then returning slices of it without.copy()
.
B). How about if multiple examples in one feature of a returned batch are slices/views? - What about variables in closure scope of the map function? E.g. if some variables are in scope at the point the mapper function is
def
defined, and the function tries to access those - again assuming it’s in a relatively sensible read-only way.- For example, using a pre-defined
logger
from standardlogging
library?
- For example, using a pre-defined
So at the moment I have no end to my list of things that could possibly-maybe-perhaps cause deadlock/multiprocessing issues when chaining together a couple of .map()
calls.
Any help narrowing down which things are unlikely/impossible to cause these problems, and which are probable causes, would be much appreciated!
For anybody keen to take the plunge, the code I’m struggling with lately is actually public here: Specifically the data
subfolder and dataproc_num_workers
parameter… But it’s part of a long workflow and designed to run on Amazon SageMaker + AWS - so not exactly a minimal & locally-reproducible example.