sagemaker /04_distributed_training_model_parallelism
I have cutomized run_glue.py to accept my custom data. I get the following error:
File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py”, line 309, in thread_execute_tracing
[1,mpirank:0,algo-1]: self._exec_trace_on_device(req, device)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py”, line 268, in _exec_trace_on_device
[1,mpirank:0,algo-1]: outputs = step_fn(*args, **kwargs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/transformers/trainer_pt_utils.py”, line 1061, in smp_forward_backward
[1,mpirank:0,algo-1]: outputs = model(**inputs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl
[1,mpirank:0,algo-1]: return forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/patches/tracing.py”, line 75, in trace_forward
[1,mpirank:0,algo-1]: raise e
…This repeats n-times
With the final error as follows:
RecursionError[1,mpirank:0,algo-1]:: maximum recursion depth exceeded while calling a Python object
@philschmid , I am tagging you since you created this notebook.