Package errors running huggingface estimator on sagemaker

I’ve been running the code from this example on a sagemaker notebook with no changes:

When i run with the git_config variables, the model works fine. When i download the same training script and requirements file from the 4.6 branch and update the source_dir and entry_point, it usually leads to an missing package error.

How can i resolve this? I need to be able to run from a local training script as entry point.

Estimator code

huggingface_estimator = HuggingFace(entry_point='run_qa_46.py', #local path i downloaded
                                    source_dir='./scripts',
#                                     source_dir='./examples/pytorch/question-answering',
#                                     git_config=git_config,
                                    metric_definitions=metric_definitions,
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    volume_size=volume_size,
                                    role=role,
                                    transformers_version='4.6',
                                    pytorch_version='1.7',
                                    py_version='py36',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters)

Traceback

2023-02-09 19:40:13 Starting - Starting the training job.........
2023-02-09 19:41:46 Starting - Preparing the instances for training.........
2023-02-09 19:43:27 Downloading - Downloading input data
2023-02-09 19:43:27 Training - Downloading the training image............
2023-02-09 19:45:28 Training - Training image download completed. Training in progress......bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-02-09 19:46:20,139 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2023-02-09 19:46:20,215 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2023-02-09 19:46:20,218 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel
2023-02-09 19:46:20,218 sagemaker_pytorch_container.training INFO     Invoking user training script.
2023-02-09 19:46:20,450 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/opt/conda/bin/python3.6 -m pip install -r requirements.txt
Requirement already satisfied: datasets>=1.4.0 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (1.6.2)
Requirement already satisfied: torch>=1.3.0 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (1.7.1)
Requirement already satisfied: huggingface-hub<0.1.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.4.0->-r requirements.txt (line 1)) (0.0.8)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas->datasets>=1.4.0->-r requirements.txt (line 1)) (1.16.0)
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv
2023-02-09 19:46:22,633 sagemaker-training-toolkit INFO     Starting MPI run as worker node.
2023-02-09 19:46:22,633 sagemaker-training-toolkit INFO     Creating SSH daemon.
2023-02-09 19:46:22,635 sagemaker-training-toolkit INFO     Waiting for MPI workers to establish their SSH connections
2023-02-09 19:46:22,636 sagemaker-training-toolkit INFO     Network interface name: eth0
2023-02-09 19:46:22,636 sagemaker-training-toolkit INFO     Host: ['algo-1']
2023-02-09 19:46:22,637 sagemaker-training-toolkit INFO     instance type: ml.p3.16xlarge
2023-02-09 19:46:22,714 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {
        "sagemaker_distributed_dataparallel_custom_mpi_options": "",
        "sagemaker_distributed_dataparallel_enabled": true,
        "sagemaker_instance_type": "ml.p3.16xlarge"
    },
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "dataset_name": "squad",
        "do_eval": true,
        "do_train": true,
        "doc_stride": 128,
        "fp16": true,
        "max_seq_length": 384,
        "max_steps": 100,
        "model_name_or_path": "bert-large-uncased-whole-word-masking",
        "num_train_epochs": 2,
        "output_dir": "/opt/ml/model",
        "pad_to_max_length": true,
        "per_device_eval_batch_size": 4,
        "per_device_train_batch_size": 4
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "huggingface-pytorch-training-2023-02-09-19-40-09-335",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-479532355287/huggingface-pytorch-training-2023-02-09-19-40-09-335/source/sourcedir.tar.gz",
    "module_name": "run_qa_46",
    "network_interface_name": "eth0",
    "num_cpus": 64,
    "num_gpus": 8,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.p3.16xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.p3.16xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "run_qa_46.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_name":"squad","do_eval":true,"do_train":true,"doc_stride":128,"fp16":true,"max_seq_length":384,"max_steps":100,"model_name_or_path":"bert-large-uncased-whole-word-masking","num_train_epochs":2,"output_dir":"/opt/ml/model","pad_to_max_length":true,"per_device_eval_batch_size":4,"per_device_train_batch_size":4}
SM_USER_ENTRY_POINT=run_qa_46.py
SM_FRAMEWORK_PARAMS={"sagemaker_distributed_dataparallel_custom_mpi_options":"","sagemaker_distributed_dataparallel_enabled":true,"sagemaker_instance_type":"ml.p3.16xlarge"}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.p3.16xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.p3.16xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=run_qa_46
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=64
SM_NUM_GPUS=8
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-479532355287/huggingface-pytorch-training-2023-02-09-19-40-09-335/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{"sagemaker_distributed_dataparallel_custom_mpi_options":"","sagemaker_distributed_dataparallel_enabled":true,"sagemaker_instance_type":"ml.p3.16xlarge"},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_name":"squad","do_eval":true,"do_train":true,"doc_stride":128,"fp16":true,"max_seq_length":384,"max_steps":100,"model_name_or_path":"bert-large-uncased-whole-word-masking","num_train_epochs":2,"output_dir":"/opt/ml/model","pad_to_max_length":true,"per_device_eval_batch_size":4,"per_device_train_batch_size":4},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2023-02-09-19-40-09-335","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-479532355287/huggingface-pytorch-training-2023-02-09-19-40-09-335/source/sourcedir.tar.gz","module_name":"run_qa_46","network_interface_name":"eth0","num_cpus":64,"num_gpus":8,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so smddprun /opt/conda/bin/python3.6 -m mpi4py run_qa_46.py --dataset_name squad --do_eval True --do_train True --doc_stride 128 --fp16 True --max_seq_length 384 --max_steps 100 --model_name_or_path bert-large-uncased-whole-word-masking --num_train_epochs 2 --output_dir /opt/ml/model --pad_to_max_length True --per_device_eval_batch_size 4 --per_device_train_batch_size 4
[1,3]<stderr>:Traceback (most recent call last):
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,3]<stderr>:    "__main__", mod_spec)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,3]<stderr>:    exec(code, run_globals)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,3]<stderr>:    main()
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,3]<stderr>:    run_command_line(args)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,3]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,3]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,3]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,3]<stderr>:    exec(code, run_globals)
[1,3]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,3]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,3]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,5]<stderr>:Traceback (most recent call last):
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,5]<stderr>:    "__main__", mod_spec)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,5]<stderr>:    exec(code, run_globals)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,5]<stderr>:    main()
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,5]<stderr>:    run_command_line(args)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,5]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,5]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,5]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,5]<stderr>:    exec(code, run_globals)
[1,5]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,5]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,5]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,6]<stderr>:    "__main__", mod_spec)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,6]<stderr>:    exec(code, run_globals)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,6]<stderr>:    main()
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,6]<stderr>:    run_command_line(args)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,6]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,6]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,6]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
2023-02-09 19:46:24,583 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,6]<stderr>:    exec(code, run_globals)
[1,6]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,6]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,6]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
Command "mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so smddprun /opt/conda/bin/python3.6 -m mpi4py run_qa_46.py --dataset_name squad --do_eval True --do_train True --doc_stride 128 --fp16 True --max_seq_length 384 --max_steps 100 --model_name_or_path bert-large-uncased-whole-word-masking --num_train_epochs 2 --output_dir /opt/ml/model --pad_to_max_length True --per_device_eval_batch_size 4 --per_device_train_batch_size 4"
[1,4]<stderr>:Traceback (most recent call last):
[1,3]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,3]<stderr>:    exec(code, run_globals)
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,4]<stderr>:    "__main__", mod_spec)
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,4]<stderr>:    exec(code, run_globals)
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,4]<stderr>:    main()
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,4]<stderr>:    run_command_line(args)
[1,3]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,3]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,3]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,5]<stderr>:Traceback (most recent call last):
[1,4]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,4]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,4]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,4]<stderr>:    exec(code, run_globals)
[1,4]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,4]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,4]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,2]<stderr>:Traceback (most recent call last):
[1,2]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,2]<stderr>:    "__main__", mod_spec)
[1,2]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,2]<stderr>:    exec(code, run_globals)
[1,5]<stderr>:    "__main__", mod_spec)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,2]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,5]<stderr>:    exec(code, run_globals)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,2]<stderr>:    main()
[1,2]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,2]<stderr>:    run_command_line(args)
[1,2]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,2]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,2]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,5]<stderr>:    main()
[1,2]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,5]<stderr>:    run_command_line(args)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,5]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,5]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,5]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,5]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,5]<stderr>:    exec(code, run_globals)
[1,5]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,5]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,2]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,2]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,2]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,2]<stderr>:    exec(code, run_globals)
[1,2]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,2]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,2]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,0]<stderr>:    "__main__", mod_spec)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,5]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,0]<stderr>:    main()
[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,6]<stderr>:    "__main__", mod_spec)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,6]<stderr>:    exec(code, run_globals)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,6]<stderr>:    main()
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,6]<stderr>:    run_command_line(args)
[1,0]<stderr>:    run_command_line(args)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,0]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,0]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,0]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,6]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,6]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,6]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,6]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,6]<stderr>:    exec(code, run_globals)
[1,6]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,6]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,6]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,0]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,7]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,0]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,7]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,1]<stderr>:Traceback (most recent call last):
[1,7]<stderr>:    "__main__", mod_spec)
[1,7]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,7]<stderr>:    exec(code, run_globals)
[1,7]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,1]<stderr>:    "__main__", mod_spec)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,1]<stderr>:    exec(code, run_globals)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,1]<stderr>:    main()
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,1]<stderr>:    run_command_line(args)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,1]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,7]<stderr>:    main()
[1,7]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,7]<stderr>:    run_command_line(args)
[1,7]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,1]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,7]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,7]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,7]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,7]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,7]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,7]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,7]<stderr>:    exec(code, run_globals)
[1,7]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,7]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,7]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,4]<stderr>:Traceback (most recent call last):
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,4]<stderr>:    "__main__", mod_spec)
[1,4]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,4]<stderr>:    exec(code, run_globals)
[1,1]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,1]<stderr>:    exec(code, run_globals)
[1,1]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,1]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,1]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,0]<stderr>:    "__main__", mod_spec)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,0]<stderr>:    main()
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,0]<stderr>:    run_command_line(args)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,0]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,0]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,0]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,0]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,0]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,1]<stderr>:    "__main__", mod_spec)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,1]<stderr>:    exec(code, run_globals)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,1]<stderr>:    main()
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,1]<stderr>:    run_command_line(args)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,1]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,1]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,1]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,1]<stderr>:    exec(code, run_globals)
[1,1]<stderr>:  File "run_qa_46.py", line 30, in <module>
[1,1]<stderr>:    from trainer_qa import QuestionAnsweringTrainer
[1,1]<stderr>:ModuleNotFoundError: No module named 'trainer_qa'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[41157,1],2]
  Exit code:    1
--------------------------------------------------------------------------

2023-02-09 19:46:41 Uploading - Uploading generated training model
2023-02-09 19:46:41 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
Input In [41], in <cell line: 2>()
      1 # starting the train job
----> 2 huggingface_estimator.fit()

File ~\AppData\Roaming\Python\Python39\site-packages\sagemaker\workflow\pipeline_context.py:272, in runnable_by_pipeline.<locals>.wrapper(*args, **kwargs)
    268         return context
    270     return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
--> 272 return run_func(*args, **kwargs)

File ~\AppData\Roaming\Python\Python39\site-packages\sagemaker\estimator.py:1163, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)
   1161 self.jobs.append(self.latest_training_job)
   1162 if wait:
-> 1163     self.latest_training_job.wait(logs=logs)

File ~\AppData\Roaming\Python\Python39\site-packages\sagemaker\estimator.py:2311, in _TrainingJob.wait(self, logs)
   2309 # If logs are requested, call logs_for_jobs.
   2310 if logs != "None":
-> 2311     self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2312 else:
   2313     self.sagemaker_session.wait_for_job(self.job_name)

File ~\AppData\Roaming\Python\Python39\site-packages\sagemaker\session.py:4176, in Session.logs_for_job(self, job_name, wait, poll, log_type)
   4173             last_profiler_rule_statuses = profiler_rule_statuses
   4175 if wait:
-> 4176     self._check_job_status(job_name, description, "TrainingJobStatus")
   4177     if dot:
   4178         print()

File ~\AppData\Roaming\Python\Python39\site-packages\sagemaker\session.py:3707, in Session._check_job_status(self, job, desc, status_key_name)
   3701 if "CapacityError" in str(reason):
   3702     raise exceptions.CapacityError(
   3703         message=message,
   3704         allowed_statuses=["Completed", "Stopped"],
   3705         actual_status=status,
   3706     )
-> 3707 raise exceptions.UnexpectedStatusException(
   3708     message=message,
   3709     allowed_statuses=["Completed", "Stopped"],
   3710     actual_status=status,
   3711 )

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2023-02-09-19-40-09-335: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so smddprun /opt/conda/bin/python3.6 -m mpi4py run_qa_46.py --dataset_name squad --do_eval True --do_train True --doc_stride 128 --fp16 True --max_seq_length 384 --max_steps 100 --model_name_or_path bert-large-uncased-whole-word-masking --num_train_epochs 2 --output_dir /opt/ml/model --pad_to_max_length True --per_device_eval_batch_size 4 --per_device_train_batch_size 4"
[1,3]<stderr>:Traceback (most recent call last):

For anyone reading this, the issue was that there were two files imported into the trainer file that i needed to download from the github repo, not a module that wasn’t downloading from a package repo