Hey! Sorry to post again. This error is really over my head. The script works when on just one GPU. However, after adding in the argument for distributed training, it results in some really unusual errors. I essentially just added the distribution argument and changed instance type and # of instances. Am I missing something here? This is being run on the run _summarization.py. I also deleted some parts of the log that werenβt relevant because of the space limitations for # of characters in a post. Do I potentially have to use the wrapper script you have in the git examples folder that is meant to be used for distributed training? Thanks a lot!
from sagemaker.huggingface import HuggingFace
hyperparameters={
'model_name_or_path': 'google/pegasus-large',
'train_file': "/opt/ml/input/data/train/final_aws_deepgram_train.csv",
'test_file': "/opt/ml/input/data/test/final_aws_deepgram_test.csv",
'validation_file': "/opt/ml/input/data/validation/final_aws_deepgram_validation.csv",
'text_column': 'document',
'summary_column': 'summary',
'do_train': True,
'do_eval': True,
'fp16': True,
'per_device_train_batch_size': 2,
'per_device_eval_batch_size': 2,
'evaluation_strategy': "steps",
'eval_steps': 1000,
'weight_decay': 0.01,
'learning_rate': 2e-5,
'max_grad_norm': 1,
'max_steps': 2000,
'max_source_length': 500,
'max_target_length': 100,
'load_best_model_at_end': True,
'output_dir': '/opt/ml/model'
}
# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git', 'branch': 'v4.6.1'} #'branch': 'v4.6.1'
# instance configurations
instance_type='ml.p3.16xlarge'
instance_count=2
volume_size=200
# estimator
huggingface_estimator = HuggingFace(entry_point='run_summarization_original.py',
source_dir='transformers/examples/pytorch/summarization',
git_config=git_config,
instance_type=instance_type,
instance_count=instance_count,
volume_size=volume_size,
role=role,
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
distribution= distribution,
hyperparameters = hyperparameters)
2021-06-22 21:50:12,132 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
Command "mpirun --host algo-1:8,algo-2:8 -np 16 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -x SMDATAPARALLEL_SERVER_ADDR=algo-1 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.p3.16xlarge smddprun /opt/conda/bin/python3.6 -m mpi4py run_summarization_original.py --do_eval True --do_train True --eval_steps 1000 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 2000 --max_target_length 100 --model_name_or_path google/pegasus-large --output_dir /opt/ml/model --per_device_eval_batch_size 2 --per_device_train_batch_size 2 --summary_column summary --test_file /opt/ml/input/data/test/final_aws_deepgram_test.csv --text_column document --train_file /opt/ml/input/data/train/final_aws_deepgram_train.csv --validation_file /opt/ml/input/data/validation/final_aws_deepgram_validation.csv --weight_decay 0.01"
Warning: Permanently added 'algo-2,10.2.251.196' (ECDSA) to the list of known hosts.#015
[1,13]<stderr>:#0150 tables [00:00, ? tables/s][1,0]<stderr>:#0150 tables [00:00, ? tables/s][1,13]<stderr>:#0151 tables [00:00, 7.31 tables/s][1,13]<stderr>:#015 #015[1,13]<stderr>:#0150 tables [00:00, ? tables/s][1,13]<stderr>:#015 #015[1,13]<stderr>:#0150 tables [00:00, ? tables/s][1,13]<stderr>:#015 #015[1,0]<stderr>:#0151 tables [00:00, 7.17 tables/s][1,0]<stderr>:#015 #015[1,0]<stderr>:#0150 tables [00:00, ? tables/s][1,0]<stderr>:#015 #015[1,0]<stderr>:#0150 tables [00:00, ? tables/s][1,13]<stderr>:#015Downloading: 0%| | 0.00/3.09k [00:00<?, ?B/s][1,13][00:00<00:00, 3.60MB/s]
[1,0]<stderr>:#015 #015[1,0]<stderr>:https://huggingface.co/google/pegasus-large/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp2ojn0fqy
[1,8]<stderr>:loading configuration file https://huggingface.co/google/pegasus-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:#015Downloading: 0%| | 0.00/3.09k [00:00<?, ?B/s][1,0]<stderr>:#015Downloading: 100%|Γ’ΒΒΓ’ΒΒΓ’ΒΒΓ’ΒΒΓ’ΒΒΓ’ΒΒΓ’ΒΒΓ’ΒΒΓ’ΒΒΓ’ΒΒ| 3.09k/3.09k [00:00<00:00, 2.57MB/s]
[1,0]<stderr>:storing https://huggingface.co/google/pegasus-large/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:creating metadata file for /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:loading configuration file https://huggingface.co/google/pegasus-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:Model config PegasusConfig {
[1,0]<stderr>: "_name_or_path": "google/pegasus-large",
[1,0]<stderr>: "activation_dropout": 0.1,
[1,0]<stderr>: "activation_function": "relu",
[1,0]<stderr>: "add_bias_logits": false,
[1,0]<stderr>: "add_final_layer_norm": true,
[1,0]<stderr>: "architectures": [
[1,0]<stderr>: "PegasusForConditionalGeneration"
[1,0]<stderr>: ],
[1,8]<stderr>:loading weights file https://huggingface.co/google/pegasus-large/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/ef3a8274e003ba4d3ae63f2728378e73affec0029e797c0bbb80be8856130c4f.a99cb24bd92c7087e95d96a1c3eb660b51e498705f8bd068a58c69c20616f514
[1,0]<stderr>:loading weights file https://huggingface.co/google/pegasus-large/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/ef3a8274e003ba4d3ae63f2728378e73affec0029e797c0bbb80be8856130c4f.a99cb24bd92c7087e95d96a1c3eb660b51e498705f8bd068a58c69c20616f514
[1,12]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,15]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,9]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,8]<stderr>:All model checkpoint weights were used when initializing PegasusForConditionalGeneration.
[1,8]<stderr>: "max_position_embeddings": 1024
[1,8]<stderr>: },
[1,8]<stderr>: "summarization_reddit_tifu": {
[1,8]<stderr>: "length_penalty": 0.6,
[1,8]<stderr>: "max_length": 128,
[1,8]<stderr>: "max_position_embeddings": 512
[1,8]<stderr>: },
[1,8]<stderr>: "summarization_wikihow": {
[1,8]<stderr>: "length_penalty": 0.6,
[1,8]<stderr>: "max_length": 256,
[1,8]<stderr>: "max_position_embeddings": 512
[1,8]<stderr>: },
[1,8]<stderr>: "summarization_xsum": {
[1,8]<stderr>: "length_penalty": 0.8,
[1,8]<stderr>: "max_length": 64,
[1,8]<stderr>: "max_position_embeddings": 512
[1,8]<stderr>: }
[1,8]<stderr>: },
[1,8]<stderr>: "transformers_version": "4.6.1",
[1,8]<stderr>: "use_cache": true,
[1,8]<stderr>: "vocab_size": 96103
[1,8]<stderr>:
[1,8]<stderr>:All the weights of PegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-large.
[1,8]<stderr>:If your task is similar to the task the model of the checkpoint was trained on, you can already use PegasusForConditionalGeneration for predictions without further training.
[1,10]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,8]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,11]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,14]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,13]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,12]<stderr>:#015 50%|βββββ | 1/2 [00:01<00:01, 1.94s/ba][1,15]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.35s/ba][1,12]<stderr>:#015100%|ββββββββββ| 2/2 [00:02<00:00, 1.56s/ba][1,12]<stderr>:#015100%|ββββββββββ| 2/2 [00:02<00:00, 1.30s/ba][1,12]<stderr>:
[1,12]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,12]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 4.28ba/s][1,12]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 4.27ba/s]
[1,10]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.75s/ba][1,8]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.83s/ba][1,15]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.84s/ba][1,15]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.51s/ba][1,15]<stderr>:
[1,5]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,6]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,15]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,9]<stderr>:#015 50%|βββββ | 1/2 [00:03<00:03, 3.09s/ba][1,4]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,3]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,2]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,1]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,7]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,12]<stderr>:#015Downloading: 0%| | 0.00/2.17k [00:00<?, ?B/s][1,12]<stderr>:#015Downloading: 5.61kB [00:00, 2.18MB/s] [1,12]<stderr>:
[1,11]<stderr>:#015 50%|βββββ | 1/2 [00:03<00:03, 3.04s/ba][1,0]<stderr>:All model checkpoint weights were used when initializing PegasusForConditionalGeneration.
[1,0]<stderr>:
[1,0]<stderr>:All the weights of PegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-large.
[1,0]<stderr>:If your task is similar to the task the model of the checkpoint was trained on, you can already use PegasusForConditionalGeneration for predictions without further training.
[1,0]<stderr>:#015 0%| | 0/2 [00:00<?, ?ba/s][1,14]<stderr>:#015 50%|βββββ | 1/2 [00:03<00:03, 3.27s/ba][1,15]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 1.96ba/s][1,15]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 1.95ba/s][1,15]<stderr>:
[1,13]<stderr>:#015 50%|βββββ | 1/2 [00:03<00:03, 3.48s/ba][1,8]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.24s/ba][1,8]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.85s/ba][1,8]<stderr>:
[1,10]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.22s/ba][1,10]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.87s/ba][1,10]<stderr>:
[1,10]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,8]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,9]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.43s/ba][1,9]<stderr>:#015100%|ββββββββββ| 2/2 [00:04<00:00, 2.01s/ba][1,9]<stderr>:
[1,11]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.40s/ba][1,11]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.97s/ba][1,11]<stderr>:
[1,9]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,11]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,10]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.23ba/s][1,10]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.21ba/s][1,10]<stderr>:
[1,8]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.18ba/s][1,8]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.18ba/s]
[1,14]<stderr>:#015100%|ββββββββββ| 2/2 [00:04<00:00, 2.55s/ba][1,14]<stderr>:#015100%|ββββββββββ| 2/2 [00:04<00:00, 2.07s/ba]
[1,9]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 4.84ba/s][1,9]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 4.83ba/s]
[1,14]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,13]<stderr>:#015100%|ββββββββββ| 2/2 [00:04<00:00, 2.62s/ba][1,13]<stderr>:#015100%|ββββββββββ| 2/2 [00:04<00:00, 2.05s/ba][1,13]<stderr>:
[1,11]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 4.78ba/s][1,11]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 4.75ba/s][1,11]<stderr>:
[1,13]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,14]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.90ba/s][1,14]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.89ba/s]
[1,13]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 5.18ba/s][1,13]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 5.17ba/s]
[1,5]<stderr>:#015 50%|βββββ | 1/2 [00:01<00:01, 1.97s/ba][1,8]<stderr>:max_steps is given, it will override any value given in num_train_epochs
[1,8]<stderr>:Using amp fp16 backend
[1,5]<stderr>:#015100%|ββββββββββ| 2/2 [00:02<00:00, 1.56s/ba][1,5]<stderr>:#015100%|ββββββββββ| 2/2 [00:02<00:00, 1.28s/ba]
[1,6]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.55s/ba][1,5]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,2]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.70s/ba][1,1]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.79s/ba][1,3]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.81s/ba][1,5]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 2.81ba/s][1,5]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 2.81ba/s][1,5]<stderr>:
[1,4]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.87s/ba][1,7]<stderr>:#015 50%|βββββ | 1/2 [00:02<00:02, 2.85s/ba][1,5]<stderr>:#015Downloading: 0%| | 0.00/2.17k [00:00<?, ?B/s][1,5]<stderr>:#015Downloading: 5.61kB [00:00, 1.62MB/s]
[1,6]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.06s/ba][1,6]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.74s/ba]
[1,2]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.11s/ba][1,2]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.72s/ba]
[1,6]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,2]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,4]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.23s/ba][1,4]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.80s/ba][1,4]<stderr>:
[1,1]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.19s/ba][1,1]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.79s/ba][1,1]<stderr>:
[1,7]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.22s/ba][1,7]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.79s/ba][1,7]<stderr>:
[1,4]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,3]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 2.21s/ba][1,3]<stderr>:#015100%|ββββββββββ| 2/2 [00:03<00:00, 1.82s/ba][1,1]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,3]<stderr>:
[1,6]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 4.08ba/s][1,6]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 4.07ba/s][1,6]<stderr>:
[1,3]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,7]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,0]<stderr>:#015 50%|βββββ | 1/2 [00:03<00:03, 3.68s/ba][1,2]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.76ba/s][1,2]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.75ba/s][1,2]<stderr>:
[1,4]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.01ba/s][1,4]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 2.88ba/s][1,4]<stderr>:
[1,7]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.09ba/s][1,7]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 3.08ba/s]
[1,1]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 2.31ba/s][1,1]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 2.27ba/s]
[1,3]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 2.84ba/s][1,3]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 2.84ba/s]
[1,0]<stderr>:#015100%|ββββββββββ| 2/2 [00:04<00:00, 2.73s/ba][1,0]<stderr>:#015100%|ββββββββββ| 2/2 [00:04<00:00, 2.09s/ba]
[1,8]<stderr>:}
[1,8]<stderr>:
[1,0]<stderr>:loading configuration file https://huggingface.co/google/pegasus-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:Model config PegasusConfig {
[1,0]<stderr>: "_name_or_path": "google/pegasus-large",
[1,0]<stderr>: "activation_dropout": 0.1,
[1,0]<stderr>: "activation_function": "relu",
[1,0]<stderr>: "add_bias_logits": false,
[1,0]<stderr>: "add_final_layer_norm": true,
[1,0]<stderr>: "architectures": [
[1,0]<stderr>: "PegasusForConditionalGeneration"
[1,0]<stderr>: ],
[1,0]<stderr>: "attention_dropout": 0.1,
[1,0]<stderr>: "bos_token_id": 0,
[1,0]<stderr>: "classif_dropout": 0.0,
[1,0]<stderr>: "classifier_dropout": 0.0,
[1,0]<stderr>: "d_model": 1024,
[1,0]<stderr>: "decoder_attention_heads": 16,
[1,0]<stderr>: "decoder_ffn_dim": 4096,
[1,0]<stderr>: "decoder_layerdrop": 0.0,
[1,0]<stderr>: "decoder_layers": 16,
[1,0]<stderr>: "decoder_start_token_id": 0,
[1,0]<stderr>: "dropout": 0.1,
[1,0]<stderr>: "encoder_attention_heads": 16,
[1,0]<stderr>: "encoder_ffn_dim": 4096,
[1,0]<stderr>: "encoder_layerdrop": 0.0,
[1,0]<stderr>: "encoder_layers": 16,
[1,0]<stderr>: "eos_token_id": 1,
[1,0]<stderr>: "extra_pos_embeddings": 1,
[1,0]<stderr>: "force_bos_token_to_be_generated": false,
[1,0]<stderr>: "forced_eos_token_id": 1,
[1,0]<stderr>: "gradient_checkpointing": false,
[1,0]<stderr>: "id2label": {
[1,0]<stderr>:#015 0%| | 0/1 [00:00<?, ?ba/s][1,0]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 5.17ba/s][1,0]<stderr>:#015100%|ββββββββββ| 1/1 [00:00<00:00, 5.16ba/s]
[1,0]<stderr>:max_steps is given, it will override any value given in num_train_epochs
[1,0]<stderr>:Using amp fp16 backend
[1,0]<stderr>:***** Running training *****
[1,0]<stderr>: Num examples = 1558
[1,0]<stderr>: Num Epochs = 41
[1,0]<stderr>: Instantaneous batch size per device = 2
[1,0]<stderr>: Total train batch size (w. parallel, distributed & accumulation) = 32
[1,0]<stderr>: Gradient Accumulation steps = 1
[1,0]<stderr>: Total optimization steps = 2000
[1,0]<stderr>:#015 0%| | 0/2000 [00:00<?, ?it/s][1,8]<stderr>:***** Running training *****
[1,8]<stderr>: Num examples = 1558
[1,8]<stderr>: Num Epochs = 41
[1,8]<stderr>: Instantaneous batch size per device = 2
[1,8]<stderr>: Total train batch size (w. parallel, distributed & accumulation) = 32
[1,8]<stderr>: Gradient Accumulation steps = 1
[1,0]<stderr>: "0": "LABEL_0",
[1,0]<stderr>: "1": "LABEL_1",
[1,0]<stderr>: "2": "LABEL_2"
[1,0]<stderr>: },
[1,0]<stderr>: "init_std": 0.02,
[1,0]<stderr>: "is_encoder_decoder": true,
[1,0]<stderr>: "label2id": {
[1,0]<stderr>: "LABEL_0": 0,
[1,0]<stderr>: "LABEL_1": 1,
[1,0]<stderr>: "LABEL_2": 2
[1,0]<stderr>: },
[1,0]<stderr>: "length_penalty": 0.8,
[1,0]<stderr>: "max_length": 256,
[1,0]<stderr>: "max_position_embeddings": 1024,
[1,0]<stderr>: "model_type": "pegasus",
[1,0]<stderr>: "normalize_before": true,
[1,0]<stderr>: "normalize_embedding": false,
[1,0]<stderr>: "num_beams": 8,
[1,0]<stderr>: "num_hidden_layers": 16,
[1,0]<stderr>: "pad_token_id": 0,
[1,0]<stderr>: "scale_embedding": true,
[1,0]<stderr>: "static_position_embeddings": true,
[1,0]<stderr>: "task_specific_params": {
[1,0]<stderr>: "summarization_aeslc": {
[1,0]<stderr>: "length_penalty": 0.6,
[1,0]<stderr>: "max_length": 32,
[1,0]<stderr>: "max_position_embeddings": 512
[1,0]<stderr>: },
[1,0]<stderr>: "summarization_arxiv": {
[1,0]<stderr>: "length_penalty": 0.8,
[1,0]<stderr>: "max_length": 256,
[1,8]<stderr>: Total optimization steps = 2000
[1,8]<stderr>:#015 0%| | 0/2000 [00:00<?, ?it/s][1,8]<stderr>:#015 0%| | 1/2000 [00:07<4:03:57, 7.32s/it][1,0]<stderr>:#015 0%| | 1/2000 [00:07<4:07:45, 7.44s/it][1,8]<stderr>:#015 0%| | 2/2000 [00:09<3:17:17, 5.92s/it][1,0]<stderr>:#015 0%| | 2/2000 [00:10<3:19:39, 6.00s/it][1,0]<stderr>:#015 0%| | 3/2000 [00:11<2:33:21, 4.61s/it][1,8]<stderr>:#015 0%| | 3/2000 [00:11<2:33:05, 4.60s/it][1,0]<stderr>:#015 0%| | 4/2000 [00:12<2:01:18, 3.65s/it][1,8]<stderr>:#015 0%| | 4/2000 [00:12<2:01:26, 3.65s/it][1,0]<stderr>:#015 0%| | 5/2000 [00:15<1:47:36, 3.24s/it][1,8]<stderr>:#015 0%| | 5/2000 [00:15<1:47:43, 3.24s/it]--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 6 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[algo-1:00046] 13 more processes have sent help message help-mpi-api.txt / mpi-abort
[algo-1:00046] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
MPI_ABORT was invoked on rank 6 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[algo-1:00046] 13 more processes have sent help message help-mpi-api.txt / mpi-abort
[algo-1:00046] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2021-06-22 21:50:32 Failed - Training job failed
ProfilerReport-1624398094: Stopping
2021-06-22 21:50:42,166 sagemaker-training-toolkit INFO MPI process finished.
2021-06-22 21:50:42,166 sagemaker-training-toolkit INFO Reporting training SUCCESS
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-10-7e1bcc378f37> in <module>
3 {'train': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_train.csv',
4 'test': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_test.csv',
----> 5 'validation': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_validation.csv'}
6 )
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
680 self.jobs.append(self.latest_training_job)
681 if wait:
--> 682 self.latest_training_job.wait(logs=logs)
683
684 def _compilation_job_name(self):
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1623 # If logs are requested, call logs_for_jobs.
1624 if logs != "None":
-> 1625 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1626 else:
1627 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3694
3695 if wait:
-> 3696 self._check_job_status(job_name, description, "TrainingJobStatus")
3697 if dot:
3698 print()
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3254 ),
3255 allowed_statuses=["Completed", "Stopped"],
-> 3256 actual_status=status,
3257 )
3258
UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-06-22-21-41-34-638: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "mpirun --host algo-1:8,algo-2:8 -np 16 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -x SMDATAPARALLEL_SERVER_ADDR=algo-1 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.p3.16xlarge smddprun /opt/conda/bin/python3.6 -m mpi4py run_summarization_original.py --do_eval True --do_train True --eval_steps 1000 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 2000 --max_target_length 100 --m