Distributed Training run_summarization.py

Hi,
I cannot for the life of me figure out what is going wrong. I am following the tutorial 08 distributed training and when sending the job to AWS the error continually shows up. I am running it on a local jupyter notebook with these setups:



[1,0]:storing https://huggingface.co/facebook/bart-large-cnn/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/4ccdf4cdc01b790f9f9c636c7695b5d443180e8dbd0cbe49e07aa918dda1cef0.fa29468c10a34ef7f6cfceba3b174d3ccc95f8d755c3ca1b829aff41cc92a300
[1,0]:creating metadata file for /root/.cache/huggingface/transformers/4ccdf4cdc01b790f9f9c636c7695b5d443180e8dbd0cbe49e07aa918dda1cef0.fa29468c10a34ef7f6cfceba3b174d3ccc95f8d755c3ca1b829aff41cc92a300
[1,5]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,7]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,3]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,0]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,1]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,4]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,6]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set


[1,6]:#015100%|██████████| 1/1 [00:00<00:00, 8.65ba/s][1,6]:#015100%|██████████| 1/1 [00:00<00:00, 8.62ba/s]
[1,7]:#015Downloading: 0%| | 0.00/2.17k [00:00<?, ?B/s][1,7]:#015Downloading: 5.61kB [00:00, 5.02MB/s]
[1,0]:#015Downloading: 0%| | 0.00/2.17k [00:00<?, ?B/s][1,0]:#015Downloading: 5.61kB [00:00, 4.54MB/s]
[1,0]:Using amp fp16 backend
[1,0]:***** Running training *****
[1,0]: Num examples = 14732
[1,0]: Num Epochs = 3
[1,0]: Instantaneous batch size per device = 4
[1,0]: Total train batch size (w. parallel, distributed & accumulation) = 32
[1,0]: Gradient Accumulation steps = 1
[1,0]: Total optimization steps = 1383
[1,0]:#015 0%| | 0/1383 [00:00<?, ?it/s][1,0]:#015 0%| | 1/1383 [00:02<59:19, 2.58s/it][1,0]:#015 0%| | 2/1383 [00:04<51:42, 2.25s/it][1,0]:#015 0%| | 3/1383 [00:05<45:46, 1.99s/it][1,0]:#015 0%| | 4/1383 [00:07<45:13, 1.97s/it][1,0]:#015 0%| | 5/1383 [00:09<43:12, 1.88s/it][1,0]:#015 0%| | 6/1383 [00:10<41:53, 1.83s/it][1,0]:#015 1%| | 7/1383 [00:12<41:48, 1.82s/it][1,0]:#015 1%| | 8/1383 [00:14<43:16, 1.89s/it][1,0]:#015 1%| | 9/1383 [00:16<44:01, 1.92s/it][1,0]:#015 1%| | 10/1383 [00:18<44:03, 1.93s/it][1,0]:#015 1%| | 11/1383 [00:20<44:13, 1.93s/it][1,0]:#015 1%| | 12/1383 [00:22<44:31, 1.95s/it][1,0]:#015 1%| | 13/1383 [00:24<44:06, 1.93s/it][1,0]:#015 1%| | 14/1383 [00:26<43:58, 1.93s/it][1,0]:#015 1%| | 15/1383 [00:28<43:42, 1.92s/it][1,0]:#015 1%| | 16/1383 [00:30<43:46, 1.92s/it][1,0]:#015 1%| | 17/1383 [00:32<43:47, 1.92s/it]--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

2021-07-29 04:41:50 Uploading - Uploading generated training model
2021-07-29 04:41:50 Failed - Training job failed

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-07-29-04-30-08-439: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command “mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so smddprun /opt/conda/bin/python3.6 -m mpi4py run_summarization.py --dataset_name samsum --do_eval True --do_predict True --do_train True --fp16 True --learning_rate 5e-05 --model_name_or_path facebook/bart-large-cnn --num_train_epochs 3 --output_dir /opt/ml/model --per_device_eval_batch_size 4 --per_device_train_batch_size 4 --predict_with_generate True --seed 7”
[1,2]:Environment variable SAGEMAKER_INSTANCE_TYPE is n

Any help would be greatly appreciated. The CloudWatch logs don’t really have anything else to say. Definitely seems like a problem with setting the environment variable SAGEMAKER_INSTANCE_TYPE but I thought I already set it by specifying instance_type when initializing huggingface_estimator??

Thanks :slight_smile:

Hey @cdwyer1bod,

Thanks for opening the thread. Happy to help you.
Could still share the full cloudwatch logs? sometimes the errors are a bit hidden.

I saw you changed the instance ml.p3dn.24xlarge to ml.p3.16xlarge and kept the batch_size this could be the issue. Could reduce the batch_size to 2 or change the instances type?

For me it worked with ml.p3.16xlarge and batch_size of 2

Hey @philschmid thanks for the solution,

It ended up being a CUDA memory issue, I bumped down the learning rate, upped the instance number to 2 and put batch_size at 2 and it worked after a while.

Thanks :slight_smile:

1 Like