I’ve posted the stacktrace from sagemaker, but essentially i get ValueError: not enough values to unpack (expected 2, got 1). I can train fine locally on multiple gpus using fairscale simple.
2022-03-02 15:14:03 Starting - Starting the training job...
2022-03-02 15:14:27 Starting - Launching requested ML instancesProfilerReport-1646234043: InProgress
.........
smdistributed/modelparallel/backend/split.py:166] Non-splittable object of type <class 'NoneType'> passed to smp.step. If this object contains tensors that need to be split across microbatches, implement a 'smp_slice' method for this class. See SMP documentation for further information.
[1,mpirank:2,algo-1]<stdout>:[2022-03-02 15:22:36.545 algo-1:54 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[1,mpirank:0,algo-1]<stderr>:Using apex fp16 backend
[1,mpirank:0,algo-1]<stdout>:Using apex fp16 backend
[1,mpirank:6,algo-1]<stdout>:[2022-03-02 15:22:36.560 algo-1:58 INFO profiler_config_parser.py:102] User has disabled profiler.
[1,mpirank:6,algo-1]<stdout>:[2022-03-02 15:22:36.560 algo-1:58 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[1,mpirank:6,algo-1]<stdout>:[2022-03-02 15:22:36.561 algo-1:58 INFO hook.py:200] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[1,mpirank:6,algo-1]<stdout>:[2022-03-02 15:22:36.561 algo-1:58 INFO hook.py:255] Saving to /opt/ml/output/tensors
[1,mpirank:6,algo-1]<stdout>:[2022-03-02 15:22:36.561 algo-1:58 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[1,mpirank:0,algo-1]<stderr>:***** Running training *****
[1,mpirank:0,algo-1]<stdout>:***** Running training *****
[1,mpirank:0,algo-1]<stderr>: Num examples = 32
[1,mpirank:0,algo-1]<stderr>: Num Epochs = 1
[1,mpirank:0,algo-1]<stderr>: Instantaneous batch size per device = 1
[1,mpirank:0,algo-1]<stdout>: Num examples = 32
[1,mpirank:0,algo-1]<stdout>: Num Epochs = 1
[1,mpirank:0,algo-1]<stdout>: Instantaneous batch size per device = 1
[1,mpirank:0,algo-1]<stdout>: Total train batch size (w. parallel, distributed & accumulation) = 8
[1,mpirank:0,algo-1]<stderr>: Total train batch size (w. parallel, distributed & accumulation) = 8
[1,mpirank:0,algo-1]<stderr>: Gradient Accumulation steps = 4
[1,mpirank:0,algo-1]<stderr>: Total optimization steps = 4
[1,mpirank:0,algo-1]<stdout>: Gradient Accumulation steps = 4
[1,mpirank:0,algo-1]<stdout>: Total optimization steps = 4
[1,mpirank:0,algo-1]<stderr>:#015 0%| | 0/4 [00:00<?, ?it/s]
[
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:36.670 algo-1:52 INFO hook.py:200] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:36.671 algo-1:52 INFO hook.py:255] Saving to /opt/ml/output/tensors
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:36.671 algo-1:52 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:36.678: W smdistributed/modelparallel/backend/split.py:166] Non-splittable object of type <class 'NoneType'> passed to smp.step. If this object contains tensors that need to be split across microbatches, implement a 'smp_slice' method for this class. See SMP documentation for further information.
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:36.679: I smdistributed/modelparallel/torch/worker.py:280] Tracing on GPU. If the model parameters do not fit in a single GPU, you can set trace_device to `cpu`.
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:36.995 algo-1:52 INFO hook.py:591] name:led.shared.weight count_params:51470336
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:36.996 algo-1:52 INFO hook.py:591] name:led.encoder.embed_positions.weight count_params:16777216
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:37.014 algo-1:52 INFO hook.py:593] Total Trainable Params: 359020544
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:37.014 algo-1:52 INFO hook.py:424] Monitoring the collections: losses
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:37.064: C smdistributed/modelparallel/torch/worker.py:105] [0] Hit an exception for 0/0 on thread 0: not enough values to unpack (expected 2, got 1)
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:37.074: C smdistributed/modelparallel/torch/worker.py:110] [0] File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 469, in _thread_compute
[1,mpirank:0,algo-1]<stdout>: self.thread_execute_tracing(req)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 286, in thread_execute_tracing
[1,mpirank:0,algo-1]<stdout>: self._exec_trace_on_device(req, device)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 250, in _exec_trace_on_device
[1,mpirank:0,algo-1]<stdout>: outputs = step_fn(*args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 1014, in smp_forward_backward
[1,mpirank:0,algo-1]<stdout>: outputs = model(**inputs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stdout>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stdout>: raise e
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stdout>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/model.py", line 436, in forward
[1,mpirank:0,algo-1]<stdout>: return self.module(*args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stdout>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stdout>: raise e
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stdout>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/ddp_model.py", line 222, in forward
[1,mpirank:0,algo-1]<stdout>: return self.module(*args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stdout>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stdout>: raise e
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stdout>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 2365, in forward
[1,mpirank:0,algo-1]<stdout>: outputs = self.led(
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stdout>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stdout>: raise e
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stdout>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 2217, in forward
[1,mpirank:0,algo-1]<stdout>: encoder_outputs = self.encoder(
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stdout>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stdout>: raise e
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stdout>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 1794, in forward
[1,mpirank:0,algo-1]<stdout>: embed_pos = self.embed_positions(input_shape)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stdout>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stdout>: raise e
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stdout>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 124, in forward
[1,mpirank:0,algo-1]<stdout>: return super().forward(positions)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stdout>: raise e
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stdout>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 124, in forward
[1,mpirank:0,algo-1]<stdout>: return super().forward(positions)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stdout>: raise e
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stdout>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 120, in forward
[1,mpirank:0,algo-1]<stdout>: bsz, seq_len = input_ids_shape[:2]
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:37.075: C smdistributed/modelparallel/torch/worker.py:111] [0] Parent exec stack []
[1,mpirank:0,algo-1]<stdout>:[2022-03-02 15:22:37.075: C smdistributed/modelparallel/torch/worker.py:112] [0] Req <TraceReq::mb:0, requester:0>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[1,mpirank:0,algo-1]<stderr>: return _run_code(code, main_globals, None,
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[1,mpirank:0,algo-1]<stderr>: exec(code, run_globals)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,mpirank:0,algo-1]<stderr>: main()
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/mpi4py/run.py", line 196, in main
[1,mpirank:0,algo-1]<stderr>: run_command_line(args)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,mpirank:0,algo-1]<stderr>: run_path(sys.argv[0], run_name='__main__')
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/runpy.py", line 265, in run_path
[1,mpirank:0,algo-1]<stderr>: return _run_module_code(code, init_globals, run_name,
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/runpy.py", line 97, in _run_module_code
[1,mpirank:0,algo-1]<stderr>: _run_code(code, mod_globals, init_globals,
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[1,mpirank:0,algo-1]<stderr>: exec(code, run_globals)
[1,mpirank:0,algo-1]<stderr>: File "ledFinalTrainer.py", line 253, in <module>
[1,mpirank:0,algo-1]<stderr>: trainer.train()
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1316, in train
[1,mpirank:0,algo-1]<stderr>: tr_loss_step = self.training_step(model, inputs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1842, in training_step
[1,mpirank:0,algo-1]<stderr>: loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps, scaler=scaler)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/step.py", line 193, in __call__
[1,mpirank:0,algo-1]<stderr>: state.exec_server.run_step_leader(mb_args, mb_kwargs, self.id)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/server.py", line 332, in run_step_leader
[1,mpirank:0,algo-1]<stderr>: self.execute_request(
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/server.py", line 108, in execute_request
[1,mpirank:0,algo-1]<stderr>: chosen_worker.execute(req)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 150, in execute
[1,mpirank:0,algo-1]<stderr>: self._resume_thread_common()
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 181, in _resume_thread_common
[1,mpirank:0,algo-1]<stderr>: self._check_queue_after_thread_return()
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 116, in _check_queue_after_thread_return
[1,mpirank:0,algo-1]<stderr>: self._check_exception()
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 113, in _check_exception
[1,mpirank:0,algo-1]<stderr>: raise self.exception
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 469, in _thread_compute
[1,mpirank:0,algo-1]<stderr>: self.thread_execute_tracing(req)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 286, in thread_execute_tracing
[1,mpirank:0,algo-1]<stderr>: self._exec_trace_on_device(req, device)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/worker.py", line 250, in _exec_trace_on_device
[1,mpirank:0,algo-1]<stderr>: outputs = step_fn(*args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 1014, in smp_forward_backward
[1,mpirank:0,algo-1]<stderr>: outputs = model(**inputs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stderr>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stderr>: raise e
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stderr>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/model.py", line 436, in forward
[1,mpirank:0,algo-1]<stderr>: return self.module(*args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stderr>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stderr>: raise e
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stderr>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/ddp_model.py", line 222, in forward
[1,mpirank:0,algo-1]<stderr>: return self.module(*args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stderr>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stderr>: raise e
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stderr>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 2365, in forward
[1,mpirank:0,algo-1]<stderr>: outputs = self.led(
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stderr>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stderr>: raise e
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stderr>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 2217, in forward
[1,mpirank:0,algo-1]<stderr>: encoder_outputs = self.encoder(
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stderr>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stderr>: raise e
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stderr>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 1794, in forward
[1,mpirank:0,algo-1]<stderr>: embed_pos = self.embed_positions(input_shape)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1100, in _call_impl
[1,mpirank:0,algo-1]<stderr>: result = forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stderr>: raise e
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stderr>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 124, in forward
[1,mpirank:0,algo-1]<stderr>: return super().forward(positions)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stderr>: raise e
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stderr>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 124, in forward
[1,mpirank:0,algo-1]<stderr>: return super().forward(positions)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 68, in trace_forward
[1,mpirank:0,algo-1]<stderr>: raise e
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/patches/tracing.py", line 51, in trace_forward
[1,mpirank:0,algo-1]<stderr>: output = original_forward(self, *args, **kwargs)
[1,mpirank:0,algo-1]<stderr>: File "/opt/conda/lib/python3.8/site-packages/transformers/models/led/modeling_led.py", line 120, in forward
[1,mpirank:0,algo-1]<stderr>: bsz, seq_len = input_ids_shape[:2]
[1,mpirank:0,algo-1]<stderr>:ValueError: not enough values to unpack (expected 2, got 1)
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[41174,1],0]
Exit code: 1
--------------------------------------------------------------------------
2022-03-02 15:22:39,556 sagemaker-training-toolkit ERROR Reporting training FAILURE
2022-03-02 15:22:39,557 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage ":ValueError: not enough values to unpack (expected 2, got 1)
-------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun.real detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[41174,1],0] Exit code: 1"
Command "mpirun --host algo-1:8 -np 8 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/opt/conda/lib/python3.8/site-packages/gethostname.cpython-38-x86_64-linux-gnu.so -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_CHANNEL_TEST -x SM_CHANNEL_TRAIN -x SM_HP_EVALUATION_STRATEGY -x SM_HP_EVAL_BATCH_SIZE -x SM_HP_GRADIENT_ACCUMULATION_STEPS -x SM_HP_TRAIN_BATCH_SIZE -x SM_HP_MODEL_NAME -x SM_HP_WARMUP_STEPS -x SM_HP_OUTPUT_DIR -x SM_HP_EPOCHS -x SM_HP_LOGGING_STEPS -x SM_HP_MP_PARAMETERS -x PYTHONPATH /opt/conda/bin/python3.8 -m mpi4py ledFinalTrainer.py --epochs 1 --eval_batch_size 1 --evaluation_strategy steps --gradient_accumulation_steps 4 --logging_steps 100 --model_name HHousen/distil-led-large-cnn-16384 --mp_parameters ddp=True,microbatches=1,optimize=speed,partitions=4,pipeline=interleaved,placement_strategy=spread --output_dir /opt/ml/model --train_batch_size 1 --warmup_steps 25"
2022-03-02 15:22:39,557 sagemaker-training-toolkit ERROR Encountered exit_code 1
2022-03-02 15:22:53 Uploading - Uploading generated training model
2022-03-02 15:22:53 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-47-aba514c9e2f8> in <module>
----> 1 huggingface_estimator.fit({"train":"s3://decisions-data/train","test":"s3://decisions-data/test"})
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
690 self.jobs.append(self.latest_training_job)
691 if wait:
--> 692 self.latest_training_job.wait(logs=logs)
693
694 def _compilation_job_name(self):
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1653 # If logs are requested, call logs_for_jobs.
1654 if logs != "None":
-> 1655 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1656 else:
1657 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3777
3778 if wait:
-> 3779 self._check_job_status(job_name, description, "TrainingJobStatus")
3780 if dot:
3781 print()
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3336 ),
3337 allowed_statuses=["Completed", "Stopped"],
-> 3338 actual_status=status,
3339 )
3340
UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2022-03-02-15-14-03-282: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage ":ValueError: not enough values to unpack (expected 2, got 1)
-------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun.real detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[41174,1],0] Exit code: 1"
Command "mpirun --host algo-1:8 -np 8 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/opt/conda/lib/python3.8/site-packages/gethostname.cpython-38-x86_64-linux-gnu.so -```