Optimum-neuron example script fails on trainium instance

Hi,
I am trying to get the example script at https://github.com/huggingface/optimum-neuron/blob/main/examples/language-modeling/run_clm.py to run on an AWS Trainium instance.

I use the following command to start the script:

torchrun --nproc_per_node=2 run_clm.py --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir test_clm

The trainer runs only a few iterations before throwing an error. I posted the stack trace below.

0%|â–Ź                                                                                                                                                  | 1/870 [00:13<3:18:52, 13.73s/it]............................Killed
2023-11-01 12:46:19.000626: INFO ||NCC_WRAPPER||: Compilation failed for /tmp/neuroncc_compile_workdir/24718dea-4712-467f-810f-f751081d8479/model.MODULE_8614444119913647429+d41d8cd9.hlo.pb after 0 retries.
2023-11-01 12:46:19.000689: INFO ||NCC_WRAPPER||: Compilation failed after reaching max retries.
2023-11-01 12:46:21.245111: E tensorflow/libtpu/neuron/neuron_compiler.cc:216] NEURONPOC : Unable to delete temp file /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff
2023-11-01 12:46:21.256044: E tensorflow/libtpu/neuron/neuron_compiler.cc:371] NEURONPOC: Could not read NEFF from MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff - Status : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
2023-11-01 12:46:35.930907: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
2023-11-01 12:46:36.033262: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
  0%|â–Ž                                                                                                                                                | 2/870 [09:35<81:00:15, 335.96s/it]2023-11-01 12:46:39.055732: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-11-01 12:46:39.057339: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-11-01 12:46:39.057349: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	tsl::CurrentStackTrace()
2023-11-01 12:46:39.057353: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-11-01 12:46:39.057358: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-11-01 12:46:39.057363: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.057366: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-11-01 12:46:39.057370: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.057373: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.057376: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.057379: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.057382: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-11-01 12:46:39.057385: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-11-01 12:46:39.057389: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: NOT_FOUND: From /job:localservice/replica:0/task:0:
2023-11-01 12:46:39.057392: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-11-01 12:46:39.057395: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
2023-11-01 12:46:39.057399: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[{{node XRTExecute}}]]
2023-11-01 12:46:39.057402: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[XRTExecute_G15]]
2023-11-01 12:46:39.057406: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
2023-11-01 12:46:39.057409: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[{{node XRTExecute}}]]
2023-11-01 12:46:39.057412: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-11-01 12:46:39.057415: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-11-01 12:46:39.057418: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-11-01 12:46:39.057421: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
.2023-11-01 12:46:39.439488: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-11-01 12:46:39.441494: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-11-01 12:46:39.441498: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	tsl::CurrentStackTrace()
2023-11-01 12:46:39.441506: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-11-01 12:46:39.441510: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-11-01 12:46:39.441517: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.441521: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-11-01 12:46:39.441524: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.441528: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.441531: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.441534: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-11-01 12:46:39.441537: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-11-01 12:46:39.441543: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-11-01 12:46:39.441547: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: NOT_FOUND: From /job:localservice/replica:0/task:0:
2023-11-01 12:46:39.441550: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-11-01 12:46:39.441554: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
2023-11-01 12:46:39.441557: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[{{node XRTExecute}}]]
2023-11-01 12:46:39.441563: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[XRTExecute_G20]]
2023-11-01 12:46:39.441567: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
2023-11-01 12:46:39.441573: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[{{node XRTExecute}}]]
2023-11-01 12:46:39.441577: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-11-01 12:46:39.441579: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-11-01 12:46:39.441583: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-11-01 12:46:39.441586: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
2023-11-01 12:46:39.441592: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
Traceback (most recent call last):
  File "/home/ubuntu/run_clm.py", line 616, in <module>
    main()
  File "/home/ubuntu/run_clm.py", line 564, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/utils/patching.py", line 180, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 439, in _inner_training_loop
    return super()._inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1816, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/distributed/parallel_loader.py", line 34, in __next__
    return self.next()
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/distributed/parallel_loader.py", line 46, in next
    xm.mark_step()
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/core/xla_model.py", line 988, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: NOT_FOUND: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
	 [[{{node XRTExecute}}]]
	 [[XRTExecute_G15]]
  (1) NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
	 [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
Traceback (most recent call last):
  File "/home/ubuntu/run_clm.py", line 616, in <module>
    main()
  File "/home/ubuntu/run_clm.py", line 564, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/utils/patching.py", line 180, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 439, in _inner_training_loop
    return super()._inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1816, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/distributed/parallel_loader.py", line 34, in __next__
    return self.next()
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/distributed/parallel_loader.py", line 46, in next
    xm.mark_step()
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/core/xla_model.py", line 988, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: NOT_FOUND: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
	 [[{{node XRTExecute}}]]
	 [[XRTExecute_G20]]
  (1) NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
	 [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
  OP_REQUIRES failed at tpu_execute_op.cc:266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.19128_8614444119913647429_ip-10-0-0-135-c0e66ffe-14106-6091687d9f030.neff; No such file or directory
  0%|â–Ž                                                                                                                                                | 2/870 [09:49<71:00:34, 294.51s/it]
.ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13945) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-01_12:46:59
  host      : ip-10-0-0-135.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 13946)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-01_12:46:59
  host      : ip-10-0-0-135.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13945)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I also see the same error when I try to run my own custom training script (see my previous question https://discuss.huggingface.co/t/finetuning-gpt2-model-on-aws-trainium/59910)

I find this error a bit tricky to debug, since the AWS environment I am using is pre-configured with the HuggingFace AMI. Any ideas what might be going wrong?