Distributed Training on Sagemaker

ujjirox · June 22, 2021, 10:10pm

Hey! Sorry to post again. This error is really over my head. The script works when on just one GPU. However, after adding in the argument for distributed training, it results in some really unusual errors. I essentially just added the distribution argument and changed instance type and # of instances. Am I missing something here? This is being run on the run _summarization.py. I also deleted some parts of the log that weren’t relevant because of the space limitations for # of characters in a post. Do I potentially have to use the wrapper script you have in the git examples folder that is meant to be used for distributed training? Thanks a lot!

from sagemaker.huggingface import HuggingFace

hyperparameters={
    'model_name_or_path': 'google/pegasus-large',
    'train_file': "/opt/ml/input/data/train/final_aws_deepgram_train.csv",
    'test_file': "/opt/ml/input/data/test/final_aws_deepgram_test.csv",
    'validation_file': "/opt/ml/input/data/validation/final_aws_deepgram_validation.csv",
    'text_column': 'document',
    'summary_column': 'summary',
    'do_train': True,
    'do_eval': True,
    'fp16': True,
    'per_device_train_batch_size': 2,
    'per_device_eval_batch_size': 2,
    'evaluation_strategy': "steps",
    'eval_steps': 1000,
    'weight_decay': 0.01,
    'learning_rate': 2e-5,
    'max_grad_norm': 1,
    'max_steps': 2000,
    'max_source_length': 500,
    'max_target_length': 100,
    'load_best_model_at_end': True,
    'output_dir': '/opt/ml/model'
}

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git', 'branch': 'v4.6.1'} #'branch': 'v4.6.1'

# instance configurations
instance_type='ml.p3.16xlarge'
instance_count=2
volume_size=200

# estimator
huggingface_estimator = HuggingFace(entry_point='run_summarization_original.py',
                                    source_dir='transformers/examples/pytorch/summarization',
                                    git_config=git_config,
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    volume_size=volume_size,
                                    role=role,
                                    transformers_version='4.6.1',
                                    pytorch_version='1.7.1',
                                    py_version='py36',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters)

2021-06-22 21:50:12,132 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "mpirun --host algo-1:8,algo-2:8 -np 16 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -x SMDATAPARALLEL_SERVER_ADDR=algo-1 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.p3.16xlarge smddprun /opt/conda/bin/python3.6 -m mpi4py run_summarization_original.py --do_eval True --do_train True --eval_steps 1000 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 2000 --max_target_length 100 --model_name_or_path google/pegasus-large --output_dir /opt/ml/model --per_device_eval_batch_size 2 --per_device_train_batch_size 2 --summary_column summary --test_file /opt/ml/input/data/test/final_aws_deepgram_test.csv --text_column document --train_file /opt/ml/input/data/train/final_aws_deepgram_train.csv --validation_file /opt/ml/input/data/validation/final_aws_deepgram_validation.csv --weight_decay 0.01"
Warning: Permanently added 'algo-2,10.2.251.196' (ECDSA) to the list of known hosts.#015
[1,13]<stderr>:#0150 tables [00:00, ? tables/s][1,0]<stderr>:#0150 tables [00:00, ? tables/s][1,13]<stderr>:#0151 tables [00:00,  7.31 tables/s][1,13]<stderr>:#015                                #015[1,13]<stderr>:#0150 tables [00:00, ? tables/s][1,13]<stderr>:#015                            #015[1,13]<stderr>:#0150 tables [00:00, ? tables/s][1,13]<stderr>:#015                            #015[1,0]<stderr>:#0151 tables [00:00,  7.17 tables/s][1,0]<stderr>:#015                                #015[1,0]<stderr>:#0150 tables [00:00, ? tables/s][1,0]<stderr>:#015                            #015[1,0]<stderr>:#0150 tables [00:00, ? tables/s][1,13]<stderr>:#015Downloading:   0%|          | 0.00/3.09k [00:00<?, ?B/s][1,13][00:00<00:00, 3.60MB/s]
[1,0]<stderr>:#015                            #015[1,0]<stderr>:https://huggingface.co/google/pegasus-large/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp2ojn0fqy
[1,8]<stderr>:loading configuration file https://huggingface.co/google/pegasus-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286

[1,0]<stderr>:#015Downloading:   0%|          | 0.00/3.09k [00:00<?, ?B/s][1,0]<stderr>:#015Downloading: 100%|ââââââââââ| 3.09k/3.09k [00:00<00:00, 2.57MB/s]
[1,0]<stderr>:storing https://huggingface.co/google/pegasus-large/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:creating metadata file for /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:loading configuration file https://huggingface.co/google/pegasus-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:Model config PegasusConfig {
[1,0]<stderr>:  "_name_or_path": "google/pegasus-large",
[1,0]<stderr>:  "activation_dropout": 0.1,
[1,0]<stderr>:  "activation_function": "relu",
[1,0]<stderr>:  "add_bias_logits": false,
[1,0]<stderr>:  "add_final_layer_norm": true,
[1,0]<stderr>:  "architectures": [
[1,0]<stderr>:    "PegasusForConditionalGeneration"
[1,0]<stderr>:  ],
[1,8]<stderr>:loading weights file https://huggingface.co/google/pegasus-large/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/ef3a8274e003ba4d3ae63f2728378e73affec0029e797c0bbb80be8856130c4f.a99cb24bd92c7087e95d96a1c3eb660b51e498705f8bd068a58c69c20616f514
[1,0]<stderr>:loading weights file https://huggingface.co/google/pegasus-large/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/ef3a8274e003ba4d3ae63f2728378e73affec0029e797c0bbb80be8856130c4f.a99cb24bd92c7087e95d96a1c3eb660b51e498705f8bd068a58c69c20616f514
[1,12]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,15]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,9]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,8]<stderr>:All model checkpoint weights were used when initializing PegasusForConditionalGeneration.
[1,8]<stderr>:      "max_position_embeddings": 1024
[1,8]<stderr>:    },
[1,8]<stderr>:    "summarization_reddit_tifu": {
[1,8]<stderr>:      "length_penalty": 0.6,
[1,8]<stderr>:      "max_length": 128,
[1,8]<stderr>:      "max_position_embeddings": 512
[1,8]<stderr>:    },
[1,8]<stderr>:    "summarization_wikihow": {
[1,8]<stderr>:      "length_penalty": 0.6,
[1,8]<stderr>:      "max_length": 256,
[1,8]<stderr>:      "max_position_embeddings": 512
[1,8]<stderr>:    },
[1,8]<stderr>:    "summarization_xsum": {
[1,8]<stderr>:      "length_penalty": 0.8,
[1,8]<stderr>:      "max_length": 64,
[1,8]<stderr>:      "max_position_embeddings": 512
[1,8]<stderr>:    }
[1,8]<stderr>:  },
[1,8]<stderr>:  "transformers_version": "4.6.1",
[1,8]<stderr>:  "use_cache": true,
[1,8]<stderr>:  "vocab_size": 96103
[1,8]<stderr>:
[1,8]<stderr>:All the weights of PegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-large.
[1,8]<stderr>:If your task is similar to the task the model of the checkpoint was trained on, you can already use PegasusForConditionalGeneration for predictions without further training.
[1,10]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,8]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,11]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,14]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,13]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,12]<stderr>:#015 50%|█████     | 1/2 [00:01<00:01,  1.94s/ba][1,15]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.35s/ba][1,12]<stderr>:#015100%|██████████| 2/2 [00:02<00:00,  1.56s/ba][1,12]<stderr>:#015100%|██████████| 2/2 [00:02<00:00,  1.30s/ba][1,12]<stderr>:
[1,12]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,12]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  4.28ba/s][1,12]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  4.27ba/s]
[1,10]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.75s/ba][1,8]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.83s/ba][1,15]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.84s/ba][1,15]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.51s/ba][1,15]<stderr>:
[1,5]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,6]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,15]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,9]<stderr>:#015 50%|█████     | 1/2 [00:03<00:03,  3.09s/ba][1,4]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,3]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,2]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,1]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,7]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,12]<stderr>:#015Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s][1,12]<stderr>:#015Downloading: 5.61kB [00:00, 2.18MB/s]                   [1,12]<stderr>:
[1,11]<stderr>:#015 50%|█████     | 1/2 [00:03<00:03,  3.04s/ba][1,0]<stderr>:All model checkpoint weights were used when initializing PegasusForConditionalGeneration.
[1,0]<stderr>:
[1,0]<stderr>:All the weights of PegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-large.
[1,0]<stderr>:If your task is similar to the task the model of the checkpoint was trained on, you can already use PegasusForConditionalGeneration for predictions without further training.
[1,0]<stderr>:#015  0%|          | 0/2 [00:00<?, ?ba/s][1,14]<stderr>:#015 50%|█████     | 1/2 [00:03<00:03,  3.27s/ba][1,15]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  1.96ba/s][1,15]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  1.95ba/s][1,15]<stderr>:
[1,13]<stderr>:#015 50%|█████     | 1/2 [00:03<00:03,  3.48s/ba][1,8]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.24s/ba][1,8]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.85s/ba][1,8]<stderr>:
[1,10]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.22s/ba][1,10]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.87s/ba][1,10]<stderr>:
[1,10]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,8]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,9]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.43s/ba][1,9]<stderr>:#015100%|██████████| 2/2 [00:04<00:00,  2.01s/ba][1,9]<stderr>:
[1,11]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.40s/ba][1,11]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.97s/ba][1,11]<stderr>:
[1,9]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,11]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,10]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.23ba/s][1,10]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.21ba/s][1,10]<stderr>:
[1,8]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.18ba/s][1,8]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.18ba/s]
[1,14]<stderr>:#015100%|██████████| 2/2 [00:04<00:00,  2.55s/ba][1,14]<stderr>:#015100%|██████████| 2/2 [00:04<00:00,  2.07s/ba]
[1,9]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  4.84ba/s][1,9]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  4.83ba/s]
[1,14]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,13]<stderr>:#015100%|██████████| 2/2 [00:04<00:00,  2.62s/ba][1,13]<stderr>:#015100%|██████████| 2/2 [00:04<00:00,  2.05s/ba][1,13]<stderr>:
[1,11]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  4.78ba/s][1,11]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  4.75ba/s][1,11]<stderr>:
[1,13]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,14]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.90ba/s][1,14]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.89ba/s]
[1,13]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  5.18ba/s][1,13]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  5.17ba/s]
[1,5]<stderr>:#015 50%|█████     | 1/2 [00:01<00:01,  1.97s/ba][1,8]<stderr>:max_steps is given, it will override any value given in num_train_epochs
[1,8]<stderr>:Using amp fp16 backend
[1,5]<stderr>:#015100%|██████████| 2/2 [00:02<00:00,  1.56s/ba][1,5]<stderr>:#015100%|██████████| 2/2 [00:02<00:00,  1.28s/ba]
[1,6]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.55s/ba][1,5]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,2]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.70s/ba][1,1]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.79s/ba][1,3]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.81s/ba][1,5]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  2.81ba/s][1,5]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  2.81ba/s][1,5]<stderr>:
[1,4]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.87s/ba][1,7]<stderr>:#015 50%|█████     | 1/2 [00:02<00:02,  2.85s/ba][1,5]<stderr>:#015Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s][1,5]<stderr>:#015Downloading: 5.61kB [00:00, 1.62MB/s]                   
[1,6]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.06s/ba][1,6]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.74s/ba]
[1,2]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.11s/ba][1,2]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.72s/ba]
[1,6]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,2]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,4]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.23s/ba][1,4]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.80s/ba][1,4]<stderr>:
[1,1]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.19s/ba][1,1]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.79s/ba][1,1]<stderr>:
[1,7]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.22s/ba][1,7]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.79s/ba][1,7]<stderr>:
[1,4]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,3]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  2.21s/ba][1,3]<stderr>:#015100%|██████████| 2/2 [00:03<00:00,  1.82s/ba][1,1]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,3]<stderr>:
[1,6]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  4.08ba/s][1,6]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  4.07ba/s][1,6]<stderr>:
[1,3]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,7]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,0]<stderr>:#015 50%|█████     | 1/2 [00:03<00:03,  3.68s/ba][1,2]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.76ba/s][1,2]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.75ba/s][1,2]<stderr>:
[1,4]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.01ba/s][1,4]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  2.88ba/s][1,4]<stderr>:
[1,7]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.09ba/s][1,7]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  3.08ba/s]
[1,1]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  2.31ba/s][1,1]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  2.27ba/s]
[1,3]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  2.84ba/s][1,3]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  2.84ba/s]
[1,0]<stderr>:#015100%|██████████| 2/2 [00:04<00:00,  2.73s/ba][1,0]<stderr>:#015100%|██████████| 2/2 [00:04<00:00,  2.09s/ba]
[1,8]<stderr>:}
[1,8]<stderr>:
[1,0]<stderr>:loading configuration file https://huggingface.co/google/pegasus-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3fa0446657dd3714a950ba400a3fa72686d0f815da436514e4823a973ef23e20.7a0cb161a6d34c3881891b70d4fa06557175ac7b704a19bf0100fb9c21af9286
[1,0]<stderr>:Model config PegasusConfig {
[1,0]<stderr>:  "_name_or_path": "google/pegasus-large",
[1,0]<stderr>:  "activation_dropout": 0.1,
[1,0]<stderr>:  "activation_function": "relu",
[1,0]<stderr>:  "add_bias_logits": false,
[1,0]<stderr>:  "add_final_layer_norm": true,
[1,0]<stderr>:  "architectures": [
[1,0]<stderr>:    "PegasusForConditionalGeneration"
[1,0]<stderr>:  ],
[1,0]<stderr>:  "attention_dropout": 0.1,
[1,0]<stderr>:  "bos_token_id": 0,
[1,0]<stderr>:  "classif_dropout": 0.0,
[1,0]<stderr>:  "classifier_dropout": 0.0,
[1,0]<stderr>:  "d_model": 1024,
[1,0]<stderr>:  "decoder_attention_heads": 16,
[1,0]<stderr>:  "decoder_ffn_dim": 4096,
[1,0]<stderr>:  "decoder_layerdrop": 0.0,
[1,0]<stderr>:  "decoder_layers": 16,
[1,0]<stderr>:  "decoder_start_token_id": 0,
[1,0]<stderr>:  "dropout": 0.1,
[1,0]<stderr>:  "encoder_attention_heads": 16,
[1,0]<stderr>:  "encoder_ffn_dim": 4096,
[1,0]<stderr>:  "encoder_layerdrop": 0.0,
[1,0]<stderr>:  "encoder_layers": 16,
[1,0]<stderr>:  "eos_token_id": 1,
[1,0]<stderr>:  "extra_pos_embeddings": 1,
[1,0]<stderr>:  "force_bos_token_to_be_generated": false,
[1,0]<stderr>:  "forced_eos_token_id": 1,
[1,0]<stderr>:  "gradient_checkpointing": false,
[1,0]<stderr>:  "id2label": {
[1,0]<stderr>:#015  0%|          | 0/1 [00:00<?, ?ba/s][1,0]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  5.17ba/s][1,0]<stderr>:#015100%|██████████| 1/1 [00:00<00:00,  5.16ba/s]
[1,0]<stderr>:max_steps is given, it will override any value given in num_train_epochs
[1,0]<stderr>:Using amp fp16 backend
[1,0]<stderr>:***** Running training *****
[1,0]<stderr>:  Num examples = 1558
[1,0]<stderr>:  Num Epochs = 41
[1,0]<stderr>:  Instantaneous batch size per device = 2
[1,0]<stderr>:  Total train batch size (w. parallel, distributed & accumulation) = 32
[1,0]<stderr>:  Gradient Accumulation steps = 1
[1,0]<stderr>:  Total optimization steps = 2000
[1,0]<stderr>:#015  0%|          | 0/2000 [00:00<?, ?it/s][1,8]<stderr>:***** Running training *****
[1,8]<stderr>:  Num examples = 1558
[1,8]<stderr>:  Num Epochs = 41
[1,8]<stderr>:  Instantaneous batch size per device = 2
[1,8]<stderr>:  Total train batch size (w. parallel, distributed & accumulation) = 32
[1,8]<stderr>:  Gradient Accumulation steps = 1
[1,0]<stderr>:    "0": "LABEL_0",
[1,0]<stderr>:    "1": "LABEL_1",
[1,0]<stderr>:    "2": "LABEL_2"
[1,0]<stderr>:  },
[1,0]<stderr>:  "init_std": 0.02,
[1,0]<stderr>:  "is_encoder_decoder": true,
[1,0]<stderr>:  "label2id": {
[1,0]<stderr>:    "LABEL_0": 0,
[1,0]<stderr>:    "LABEL_1": 1,
[1,0]<stderr>:    "LABEL_2": 2
[1,0]<stderr>:  },
[1,0]<stderr>:  "length_penalty": 0.8,
[1,0]<stderr>:  "max_length": 256,
[1,0]<stderr>:  "max_position_embeddings": 1024,
[1,0]<stderr>:  "model_type": "pegasus",
[1,0]<stderr>:  "normalize_before": true,
[1,0]<stderr>:  "normalize_embedding": false,
[1,0]<stderr>:  "num_beams": 8,
[1,0]<stderr>:  "num_hidden_layers": 16,
[1,0]<stderr>:  "pad_token_id": 0,
[1,0]<stderr>:  "scale_embedding": true,
[1,0]<stderr>:  "static_position_embeddings": true,
[1,0]<stderr>:  "task_specific_params": {
[1,0]<stderr>:    "summarization_aeslc": {
[1,0]<stderr>:      "length_penalty": 0.6,
[1,0]<stderr>:      "max_length": 32,
[1,0]<stderr>:      "max_position_embeddings": 512
[1,0]<stderr>:    },
[1,0]<stderr>:    "summarization_arxiv": {
[1,0]<stderr>:      "length_penalty": 0.8,
[1,0]<stderr>:      "max_length": 256,
[1,8]<stderr>:  Total optimization steps = 2000
[1,8]<stderr>:#015  0%|          | 0/2000 [00:00<?, ?it/s][1,8]<stderr>:#015  0%|          | 1/2000 [00:07<4:03:57,  7.32s/it][1,0]<stderr>:#015  0%|          | 1/2000 [00:07<4:07:45,  7.44s/it][1,8]<stderr>:#015  0%|          | 2/2000 [00:09<3:17:17,  5.92s/it][1,0]<stderr>:#015  0%|          | 2/2000 [00:10<3:19:39,  6.00s/it][1,0]<stderr>:#015  0%|          | 3/2000 [00:11<2:33:21,  4.61s/it][1,8]<stderr>:#015  0%|          | 3/2000 [00:11<2:33:05,  4.60s/it][1,0]<stderr>:#015  0%|          | 4/2000 [00:12<2:01:18,  3.65s/it][1,8]<stderr>:#015  0%|          | 4/2000 [00:12<2:01:26,  3.65s/it][1,0]<stderr>:#015  0%|          | 5/2000 [00:15<1:47:36,  3.24s/it][1,8]<stderr>:#015  0%|          | 5/2000 [00:15<1:47:43,  3.24s/it]--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 6 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[algo-1:00046] 13 more processes have sent help message help-mpi-api.txt / mpi-abort
[algo-1:00046] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


MPI_ABORT was invoked on rank 6 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[algo-1:00046] 13 more processes have sent help message help-mpi-api.txt / mpi-abort
[algo-1:00046] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

2021-06-22 21:50:32 Failed - Training job failed
ProfilerReport-1624398094: Stopping
2021-06-22 21:50:42,166 sagemaker-training-toolkit INFO     MPI process finished.
2021-06-22 21:50:42,166 sagemaker-training-toolkit INFO     Reporting training SUCCESS
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-10-7e1bcc378f37> in <module>
      3   {'train': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_train.csv',
      4    'test': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_test.csv',
----> 5   'validation': 's3://qfn-transcription/ujjawal_files/final_aws_deepgram_validation.csv'}
      6 )

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    680         self.jobs.append(self.latest_training_job)
    681         if wait:
--> 682             self.latest_training_job.wait(logs=logs)
    683 
    684     def _compilation_job_name(self):

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1623         # If logs are requested, call logs_for_jobs.
   1624         if logs != "None":
-> 1625             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1626         else:
   1627             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3694 
   3695         if wait:
-> 3696             self._check_job_status(job_name, description, "TrainingJobStatus")
   3697             if dot:
   3698                 print()

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3254                 ),
   3255                 allowed_statuses=["Completed", "Stopped"],
-> 3256                 actual_status=status,
   3257             )
   3258 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-06-22-21-41-34-638: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "mpirun --host algo-1:8,algo-2:8 -np 16 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -x SMDATAPARALLEL_SERVER_ADDR=algo-1 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.p3.16xlarge smddprun /opt/conda/bin/python3.6 -m mpi4py run_summarization_original.py --do_eval True --do_train True --eval_steps 1000 --evaluation_strategy steps --fp16 True --learning_rate 2e-05 --load_best_model_at_end True --max_grad_norm 1 --max_source_length 500 --max_steps 2000 --max_target_length 100 --m

philschmid · June 23, 2021, 6:45am

Hey @ujjirox,

Could you upload the logs as files maybe? When running distributed training it sometimes happen that the real error is way above the exit.
Without seeing the full error. It might be possible that your batch_size is too big. When scaling up from p3.2xlarge to p3.16xlarge (same GPUs) SageMaker might use more of the GPU memory for the distribution.

OlivierCR · June 23, 2021, 7:42am

@ujjirox don’t be sorry, it’s a pleasure to get activity on the forum and interact with users please post as much as needed!

ujjirox · June 23, 2021, 2:28pm

Hey guys. Here it is. Thanks for your help! The batch size did occur to me but a batch size of 2 seems pretty small and I know for sure that this works on the single Tesla V100 GPU. The max length on the document is also limited to just 500.

Unfortunately, only image files are accepted. I put the log onto a txt file if that helps.

Thanks!

philschmid · June 23, 2021, 2:36pm

Thanks for uploading the logs. As suspected you get a CUDA out of memory which seems to be very close. You can find multiple of them when you search for “CUDA out of memory” in the log.

[1,3]<stdout>:RuntimeError: CUDA out of memory. Tried to allocate 52.00 MiB (GPU 3; 15.78 GiB total capacity; 14.40 GiB already allocated; 49.75 MiB free; 14.47 GiB reserved in total by PyTorch)
[1,8]<stdout>:Traceback (most recent call last):
[1,8]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,8]<stdout>:    "__main__", mod_spec)
[1,8]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,8]<stdout>:    exec(code, run_globals)
[1,8]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,8]<stdout>:    main()
[1,8]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,8]<stdout>:    run_command_line(args)

You could try using the bigger p3 instance ml.p3dn.24xlarge or maybe just try to decrease the eval_batch_size to 1. From the first log, you showed it seems training is started. Maybe you get the OOM error during eval.

ujjirox · June 23, 2021, 4:16pm

Thanks for your reply! I tried running it with a batch size of 1 for both training and eval on the p3.16 but even that errored out. Could there be something else going on? Thanks!

hyperparameters={
    'model_name_or_path': 'google/pegasus-large',
    'train_file': "/opt/ml/input/data/train/final_aws_deepgram_train.csv",
    'test_file': "/opt/ml/input/data/test/final_aws_deepgram_test.csv",
    'validation_file': "/opt/ml/input/data/validation/final_aws_deepgram_validation.csv",
    'text_column': 'document',
    'summary_column': 'summary',
    'do_train': True,
    'do_eval': True,
    'per_device_train_batch_size': 1,
    'per_device_eval_batch_size': 1,
    'evaluation_strategy': "steps",
    'eval_steps': 1000,
    'learning_rate': 2e-5,
    'max_steps': 2000,
    'max_source_length': 500,
    'max_target_length': 100,
    'load_best_model_at_end': True,
    'output_dir': '/opt/ml/model'
}

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git', 'branch': 'v4.6.1'} #'branch': 'v4.6.1'

# instance configurations
instance_type='ml.p3.16xlarge'
instance_count=1
volume_size=200

Just to add to that actually, I ran the t5/bart_summarization notebook that is in your repo. With exactly the same configuration, that failed for the same reason with CUDA out of memory.

notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks (github.com)

philschmid · June 25, 2021, 12:54pm

I can confirm that running the t5/bart_summarization notebook works as it is on the ml.p3dn.24xlarge. I can remember that someone tried to run it on the 16.xlarge and needed to decrease the batch_size too.

The error you attached are still showing the Cuda error

[1,2]<stdout>:RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 2; 15.78 GiB total capacity; 13.72 GiB already allocated; 249.75 MiB free; 14.04 GiB reserved in total by PyTorch)

Can you downsample your dataset a bit and try again? or try some steps on a ml.p3dn.24xlarge

ujjirox · June 28, 2021, 1:35am

Hey, Sorry for the late reply. I haven’t been able to run it on the ml.p3dn.24xlarge because of certain permissions on my AWS account, but I have been able to make my training work on a single GPU. Thanks for all the help though! Will reach out if distributed training is a must.

Jorgeutd · July 16, 2021, 6:28pm

Hi HF/Sagemaker team! I keep getting the following error when I run the run_summarization.py example. Is there any specific instances type that should I use and why? Thank you.

Error log end:

***** Running training *****
Num examples = 15549
Num Epochs = 10
Instantaneous batch size per device = 2
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 4860
#015 0%| | 0/4860 [00:00<?, ?it/s]THCudaCheck FAIL file=…/aten/src/THC/THCGeneral.cpp line=139 error=711 : peer mapping resources exhausted
Traceback (most recent call last):
File “run_summarization.py”, line 606, in
main()
File “run_summarization.py”, line 530, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/opt/conda/lib/python3.6/site-packages/transformers/trainer.py”, line 1272, in train
tr_loss += self.training_step(model, inputs)
File “/opt/conda/lib/python3.6/site-packages/transformers/trainer.py”, line 1732, in training_step
loss = self.compute_loss(model, inputs)
File “/opt/conda/lib/python3.6/site-packages/transformers/trainer.py”, line 1766, in compute_loss
outputs = model(**inputs)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 756, in _call_impl
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 157, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 168, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 37, in scatter_kwargs
kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 28, in scatter
res = scatter_map(inputs)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 19, in scatter_map
return list(map(type(obj), zip(*map(scatter_map, obj.items()))))
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 15, in scatter_map
return list(zip(*map(scatter_map, obj)))
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 13, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 92, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/comm.py”, line 186, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: cuda runtime error (711) : peer mapping resources exhausted at …/aten/src/THC/THCGeneral.cpp:139
#015 0%| | 0/4860 [00:13<?, ?it/s]

philschmid · July 19, 2021, 7:08am

Hey @Jorgeutd,

could you share some more information about what you are doing? Could you share your

hyperparameters
the model
the approx. size of your dataset
the instance type you went with so far.

Could you also share the full logs? I can see that’s a cuda runtime error, but not sure if it’s due to memory or not.

Jorgeutd · July 21, 2021, 8:25pm

Phillip the super hero. I was able to make it work. Thank you. I increased the volume size and decreased the ‘per_device_train_batch_size’: 2, ‘per_device_eval_batch_size’: 2.

Thanks. I am having issues now trying to train a Causal lm / text generation here so I open the ValueError: Source directory does not exist in the repo. Training causal lm in sagemaker issue. I do not think you guys have done a demo / notebook with this task, I reviewed the run_clm.py and looks fine.

Thank you Phillip.

JacquesThibs · August 5, 2021, 1:33am

I’m running into the same issue with text classification (ValueError: Source directory does not exist in the repo.). Anyone know how to fix this?

Here’s my code:

import sagemaker
from sagemaker.huggingface import HuggingFace

# gets role for executing training job
role = sagemaker.get_execution_role()
hyperparameters = {
	'model_name_or_path':'distilbert-base-uncased',
	'output_dir':'/opt/ml/model',
    'dataset_name': 'imdb',
    'do_train': True,
    'do_eval': True,
    'per_device_train_batch_size': 12,
    'num_train_epochs': 5,
    'max_seq_length': 128,
    'fp16': True,
    'pad_to_max_length': True,
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.4.2'}

# configuration for running training on smdistributed Data Parallel
# smdistributed = SageMaker Distributed
distribution = {'smdistributed': {'dataparallel':{'enabled': True}}}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_glue.py',
	source_dir='./examples/pytorch/text-classification/',
	instance_type='ml.p3dn.24xlarge', # has 8 GPUs
	instance_count=2, # changed to 2 instances
	role=role,
	git_config=git_config,
	transformers_version='4.4.2',
	pytorch_version='1.6.0',
	py_version='py36',
	hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit()

JacquesThibs · August 5, 2021, 2:04am

Nevermind, I forgot to check the directory structure for transformers repo, and I think I needed to update to the latest version of SageMaker (pip install -U sagemaker).

Here’s my new code:

import sagemaker
from sagemaker.huggingface import HuggingFace

# gets role for executing training job
role = sagemaker.get_execution_role()
hyperparameters = {
	'model_name_or_path':'distilbert-base-uncased',
	'output_dir':'/opt/ml/model',
    'dataset_name': 'imdb',
    'do_train': True,
    'do_eval': True,
    'per_device_train_batch_size': 12,
    'num_train_epochs': 5,
    'max_seq_length': 128,
    'fp16': True,
    'pad_to_max_length': True,
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

# configuration for running training on smdistributed Data Parallel
# smdistributed = SageMaker Distributed
distribution = {'smdistributed': {'dataparallel':{'enabled': True}}}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_glue.py',
	source_dir='./examples/pytorch/text-classification',
	instance_type='ml.p3dn.24xlarge', # has 8 GPUs
	instance_count=2, # changed to 2 instances
	role=role,
	git_config=git_config,
	transformers_version='4.6.1',
	pytorch_version='1.7.1',
	py_version='py36',
	hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit()

philschmid · August 5, 2021, 6:59am

Hey @JacquesThibs,

I am glad that you could solve your issue.
Yes with version 4.6 the examples/ scripts got restructured.

Topic		Replies	Views
Distributed Training run_summarization.py Amazon SageMaker	3	935	July 30, 2021
Sagemaker gpt-j train file error Amazon SageMaker	27	2908	February 22, 2024
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1686	June 16, 2023
ValueError: Source directory does not exist in the repo. Training causal lm in sagemaker Amazon SageMaker	8	1607	July 26, 2021
Simple Fairscale Model Parallelization works locally, but using Sagemaker SMP gives me errors Amazon SageMaker	10	2178	June 27, 2022

Distributed Training on Sagemaker

Related topics