Distributed Training run_summarization.py

cdwyer1bod · July 29, 2021, 5:02am

Hi,
I cannot for the life of me figure out what is going wrong. I am following the tutorial 08 distributed training and when sending the job to AWS the error continually shows up. I am running it on a local jupyter notebook with these setups:

…
[1,0]:storing https://huggingface.co/facebook/bart-large-cnn/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/4ccdf4cdc01b790f9f9c636c7695b5d443180e8dbd0cbe49e07aa918dda1cef0.fa29468c10a34ef7f6cfceba3b174d3ccc95f8d755c3ca1b829aff41cc92a300
[1,0]:creating metadata file for /root/.cache/huggingface/transformers/4ccdf4cdc01b790f9f9c636c7695b5d443180e8dbd0cbe49e07aa918dda1cef0.fa29468c10a34ef7f6cfceba3b174d3ccc95f8d755c3ca1b829aff41cc92a300
[1,5]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,7]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,3]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,0]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,1]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,4]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,6]:Environment variable SAGEMAKER_INSTANCE_TYPE is not set

…
[1,6]:#015100%|ââââââââââ| 1/1 [00:00<00:00, 8.65ba/s][1,6]:#015100%|ââââââââââ| 1/1 [00:00<00:00, 8.62ba/s]
[1,7]:#015Downloading: 0%| | 0.00/2.17k [00:00<?, ?B/s][1,7]:#015Downloading: 5.61kB [00:00, 5.02MB/s]
[1,0]:#015Downloading: 0%| | 0.00/2.17k [00:00<?, ?B/s][1,0]:#015Downloading: 5.61kB [00:00, 4.54MB/s]
[1,0]:Using amp fp16 backend
[1,0]:***** Running training *****
[1,0]: Num examples = 14732
[1,0]: Num Epochs = 3
[1,0]: Instantaneous batch size per device = 4
[1,0]: Total train batch size (w. parallel, distributed & accumulation) = 32
[1,0]: Gradient Accumulation steps = 1
[1,0]: Total optimization steps = 1383
[1,0]:#015 0%| | 0/1383 [00:00<?, ?it/s][1,0]:#015 0%| | 1/1383 [00:02<59:19, 2.58s/it][1,0]:#015 0%| | 2/1383 [00:04<51:42, 2.25s/it][1,0]:#015 0%| | 3/1383 [00:05<45:46, 1.99s/it][1,0]:#015 0%| | 4/1383 [00:07<45:13, 1.97s/it][1,0]:#015 0%| | 5/1383 [00:09<43:12, 1.88s/it][1,0]:#015 0%| | 6/1383 [00:10<41:53, 1.83s/it][1,0]:#015 1%| | 7/1383 [00:12<41:48, 1.82s/it][1,0]:#015 1%| | 8/1383 [00:14<43:16, 1.89s/it][1,0]:#015 1%| | 9/1383 [00:16<44:01, 1.92s/it][1,0]:#015 1%| | 10/1383 [00:18<44:03, 1.93s/it][1,0]:#015 1%| | 11/1383 [00:20<44:13, 1.93s/it][1,0]:#015 1%| | 12/1383 [00:22<44:31, 1.95s/it][1,0]:#015 1%| | 13/1383 [00:24<44:06, 1.93s/it][1,0]:#015 1%| | 14/1383 [00:26<43:58, 1.93s/it][1,0]:#015 1%| | 15/1383 [00:28<43:42, 1.92s/it][1,0]:#015 1%| | 16/1383 [00:30<43:46, 1.92s/it][1,0]:#015 1%| | 17/1383 [00:32<43:47, 1.92s/it]--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

2021-07-29 04:41:50 Uploading - Uploading generated training model
2021-07-29 04:41:50 Failed - Training job failed

…

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-07-29-04-30-08-439: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command “mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so smddprun /opt/conda/bin/python3.6 -m mpi4py run_summarization.py --dataset_name samsum --do_eval True --do_predict True --do_train True --fp16 True --learning_rate 5e-05 --model_name_or_path facebook/bart-large-cnn --num_train_epochs 3 --output_dir /opt/ml/model --per_device_eval_batch_size 4 --per_device_train_batch_size 4 --predict_with_generate True --seed 7”
[1,2]:Environment variable SAGEMAKER_INSTANCE_TYPE is n

Any help would be greatly appreciated. The CloudWatch logs don’t really have anything else to say. Definitely seems like a problem with setting the environment variable SAGEMAKER_INSTANCE_TYPE but I thought I already set it by specifying instance_type when initializing huggingface_estimator??

Thanks

philschmid · July 29, 2021, 6:45am

Hey @cdwyer1bod,

Thanks for opening the thread. Happy to help you.
Could still share the full cloudwatch logs? sometimes the errors are a bit hidden.

I saw you changed the instance ml.p3dn.24xlarge to ml.p3.16xlarge and kept the batch_size this could be the issue. Could reduce the batch_size to 2 or change the instances type?

philschmid · July 29, 2021, 7:15am

For me it worked with ml.p3.16xlarge and batch_size of 2

cdwyer1bod · July 30, 2021, 4:31am

Hey @philschmid thanks for the solution,

It ended up being a CUDA memory issue, I bumped down the learning rate, upped the instance number to 2 and put batch_size at 2 and it worked after a while.

Thanks

Topic		Replies	Views
Distributed Training on Sagemaker Amazon SageMaker	13	2755	August 5, 2021
Multi Instance Training Error Amazon SageMaker	5	1600	October 29, 2021
Simple Fairscale Model Parallelization works locally, but using Sagemaker SMP gives me errors Amazon SageMaker	10	2187	June 27, 2022
Training on Sagemaker with Trainer() Instance Amazon SageMaker	6	2294	November 3, 2021
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1700	June 16, 2023

Distributed Training run_summarization.py

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

Related topics

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.