Hi everyone,
For quite long time Iâm struggling with some weird issue regards distributed train/eval.
Iâm running a slightly modified version of run_clm.py script with vary number of A100 GPUs (4-8) on 1 node, and keep getting the ChildFailedError
right after the training/evaluation ends.
Iâm running GPT2 (smallest model) on the OpenWebText dataset
Iâm running my code as follow:
torchrun
âstandalone
ânnodes=1
ânproc_per_node=NUM_GPU
run_clm.py
âmodel_name_or_path {MODEL}
âdataset_name {DS_NAME}
âpreprocessing_num_workers 16
âlogging_steps 5000
âsave_steps {SAVE_STEPS}
âdo_eval
âper_device_eval_batch_size {EVAL_BATCH}
âseed {RANDOM}
âevaluation_strategy steps
âlogging_dir {OUTPUT_DIR}
âoutput_dir {OUTPUT_DIR}
âoverwrite_output_dir
âddp_timeout 324000
âddp_find_unused_parameters False
âreport_to wandb
âmax_eval_samples {MAX_EVAL_SAMPLES}
ârun_name openwebtext_inference \
And getting the follow error:
warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2600 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 4 (pid: 2601) of binary: /venv/bin/python
Traceback (most recent call last):
File â/usr/lib/python3.8/runpy.pyâ, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File â/usr/lib/python3.8/runpy.pyâ, line 87, in _run_code
exec(code, run_globals)
File â/venv/lib/python3.8/site-packages/torch/distributed/launch.pyâ, line 193, in
main()
File â/venv/lib/python3.8/site-packages/torch/distributed/launch.pyâ, line 189, in main
launch(args)
File â/venv/lib/python3.8/site-packages/torch/distributed/launch.pyâ, line 174, in launch
run(args)
File â/venv/lib/python3.8/site-packages/torch/distributed/run.pyâ, line 710, in run
elastic_launch(
File â/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.pyâ, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File â/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.pyâ, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_12:56:39
host : december-ds-2h4b6-5hpkj
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 2601)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 2601
============================================================
My whole log is as follow:
12/21/2022 12:08:21 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 12:08:21 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 12:08:21 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 12:08:21 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 12:08:21 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=324000,
debug=,
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=5000,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=,
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5000,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[âwandbâ],
resume_from_checkpoint=None,
run_name=openwebtext_inference,
save_on_each_node=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=10366,
sharded_ddp=,
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
12/21/2022 12:08:21 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
[INFO|configuration_utils.py:654] 2022-12-21 12:08:21,429 >> loading configuration file config.json from cache at /store/.cache/huggingface/hub/modelsâgpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-21 12:08:21,430 >> Model config GPT2Config {
â_name_or_pathâ: âgpt2â,
âactivation_functionâ: âgelu_newâ,
âarchitecturesâ: [
âGPT2LMHeadModelâ
],
âattn_pdropâ: 0.1,
âbos_token_idâ: 50256,
âembd_pdropâ: 0.1,
âeos_token_idâ: 50256,
âinitializer_rangeâ: 0.02,
âlayer_norm_epsilonâ: 1e-05,
âmodel_typeâ: âgpt2â,
ân_ctxâ: 1024,
ân_embdâ: 768,
ân_headâ: 12,
ân_innerâ: null,
ân_layerâ: 12,
ân_positionsâ: 1024,
âreorder_and_upcast_attnâ: false,
âresid_pdropâ: 0.1,
âscale_attn_by_inverse_layer_idxâ: false,
âscale_attn_weightsâ: true,
âsummary_activationâ: null,
âsummary_first_dropoutâ: 0.1,
âsummary_proj_to_labelsâ: true,
âsummary_typeâ: âcls_indexâ,
âsummary_use_projâ: true,
âtask_specific_paramsâ: {
âtext-generationâ: {
âdo_sampleâ: true,
âmax_lengthâ: 50
}
},
âtransformers_versionâ: â4.25.1â,
âuse_cacheâ: true,
âvocab_sizeâ: 50257
}
[INFO|tokenization_auto.py:449] 2022-12-21 12:08:21,756 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:654] 2022-12-21 12:08:22,061 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/modelsâgpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-21 12:08:22,062 >> Model config GPT2Config {
â_name_or_pathâ: âgpt2â,
âactivation_functionâ: âgelu_newâ,
âarchitecturesâ: [
âGPT2LMHeadModelâ
],
âattn_pdropâ: 0.1,
âbos_token_idâ: 50256,
âembd_pdropâ: 0.1,
âeos_token_idâ: 50256,
âinitializer_rangeâ: 0.02,
âlayer_norm_epsilonâ: 1e-05,
âmodel_typeâ: âgpt2â,
ân_ctxâ: 1024,
ân_embdâ: 768,
ân_headâ: 12,
ân_innerâ: null,
ân_layerâ: 12,
ân_positionsâ: 1024,
âreorder_and_upcast_attnâ: false,
âresid_pdropâ: 0.1,
âscale_attn_by_inverse_layer_idxâ: false,
âscale_attn_weightsâ: true,
âsummary_activationâ: null,
âsummary_first_dropoutâ: 0.1,
âsummary_proj_to_labelsâ: true,
âsummary_typeâ: âcls_indexâ,
âsummary_use_projâ: true,
âtask_specific_paramsâ: {
âtext-generationâ: {
âdo_sampleâ: true,
âmax_lengthâ: 50
}
},
âtransformers_versionâ: â4.25.1â,
âuse_cacheâ: true,
âvocab_sizeâ: 50257
}
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file vocab.json from cache at .cache/datasets/processed/openwebtext/modelsâgpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file merges.txt from cache at .cache/datasets/processed/openwebtext/modelsâgpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file tokenizer.json from cache at .cache/datasets/processed/openwebtext/modelsâgpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file tokenizer_config.json from cache at None
[INFO|configuration_utils.py:654] 2022-12-21 12:08:22,761 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/modelsâgpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-21 12:08:22,761 >> Model config GPT2Config {
â_name_or_pathâ: âgpt2â,
âactivation_functionâ: âgelu_newâ,
âarchitecturesâ: [
âGPT2LMHeadModelâ
],
âattn_pdropâ: 0.1,
âbos_token_idâ: 50256,
âembd_pdropâ: 0.1,
âeos_token_idâ: 50256,
âinitializer_rangeâ: 0.02,
âlayer_norm_epsilonâ: 1e-05,
âmodel_typeâ: âgpt2â,
ân_ctxâ: 1024,
ân_embdâ: 768,
ân_headâ: 12,
ân_innerâ: null,
ân_layerâ: 12,
ân_positionsâ: 1024,
âreorder_and_upcast_attnâ: false,
âresid_pdropâ: 0.1,
âscale_attn_by_inverse_layer_idxâ: false,
âscale_attn_weightsâ: true,
âsummary_activationâ: null,
âsummary_first_dropoutâ: 0.1,
âsummary_proj_to_labelsâ: true,
âsummary_typeâ: âcls_indexâ,
âsummary_use_projâ: true,
âtask_specific_paramsâ: {
âtext-generationâ: {
âdo_sampleâ: true,
âmax_lengthâ: 50
}
},
âtransformers_versionâ: â4.25.1â,
âuse_cacheâ: true,
âvocab_sizeâ: 50257
}
[INFO|modeling_utils.py:2204] 2022-12-21 12:08:26,315 >> loading weights file pytorch_model.bin from cache at /store/.cache/huggingface/hub/modelsâgpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
[INFO|modeling_utils.py:2708] 2022-12-21 12:08:32,476 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:2716] 2022-12-21 12:08:32,476 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
12/21/2022 12:27:56 - INFO - main - *** Evaluate ***
[INFO|trainer.py:703] 2022-12-21 12:27:56,951 >> The following columns in the evaluation set donât have a corresponding argument inGPT2LMHeadModel.forward
and have been ignored: special_tokens_mask. If special_tokens_mask are not expected byGPT2LMHeadModel.forward
, you can safely ignore this message.
[INFO|trainer.py:2944] 2022-12-21 12:27:56,954 >> ***** Running Evaluation *****
[INFO|trainer.py:2946] 2022-12-21 12:27:56,954 >> Num examples = 20000
[INFO|trainer.py:2949] 2022-12-21 12:27:56,954 >> Batch size = 8
0%| | 0/500 [00:00<?, ?it/s]
âŚ
âŚ
âŚ
100%|ââââââââââ| 500/500 [08:53<00:00, 1.07s/it]
==========================================================
warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2600 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 4 (pid: 2601) of binary: /venv/bin/python
Traceback (most recent call last):
File â/usr/lib/python3.8/runpy.pyâ, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File â/usr/lib/python3.8/runpy.pyâ, line 87, in _run_code
exec(code, run_globals)
File â/venv/lib/python3.8/site-packages/torch/distributed/launch.pyâ, line 193, in
main()
File â/venv/lib/python3.8/site-packages/torch/distributed/launch.pyâ, line 189, in main
launch(args)
File â/venv/lib/python3.8/site-packages/torch/distributed/launch.pyâ, line 174, in launch
run(args)
File â/venv/lib/python3.8/site-packages/torch/distributed/run.pyâ, line 710, in run
elastic_launch(
File â/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.pyâ, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File â/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.pyâ, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_12:56:39
host : december-ds-2h4b6-5hpkj
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 2601)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 2601
Iâm also trying to add debug flags and inject assertion in different potentially places to verify the exact location it crashes, and will update it more information later on.
Versions (according to pip freeze --local
):
absl-py==1.0.0
accelerate==0.15.0
aiohttp==3.8.3
aiosignal==1.3.1
async-timeout==4.0.2
attrs==22.1.0
blessed==1.19.1
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
click==8.1.3
datasets==2.7.1
deepspeed==0.6.0
dill==0.3.6
docker-pycreds==0.4.0
evaluate==0.4.0
fairscale==0.4.6
filelock==3.8.2
frozenlist==1.3.3
fsspec==2022.11.0
gitdb==4.0.10
GitPython==3.1.29
google-auth==2.6.0
google-auth-oauthlib==0.4.6
gpustat==1.0.0
grpcio==1.44.0
hjson==3.0.2
huggingface-hub==0.11.1
idna==3.3
importlib-metadata==4.11.2
joblib==1.2.0
Markdown==3.3.6
model-compression-research @ file:///
multidict==6.0.3
multiprocess==0.70.14
ninja==1.10.2.3
nltk==3.8
numpy==1.22.3
nvidia-ml-py==11.495.46
oauthlib==3.2.0
packaging==21.3
pandas==1.5.2
pathtools==0.1.2
Pillow==9.0.1
pkg_resources==0.0.0
promise==2.3
protobuf==3.19.4
psutil==5.9.0
py-cpuinfo==8.0.0
pyarrow==10.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.7
PyYAML==6.0
regex==2022.10.31
requests==2.27.1
requests-oauthlib==1.3.1
responses==0.18.0
rsa==4.8
scikit-learn==1.2.0
scipy==1.9.3
sentencepiece==0.1.97
sentry-sdk==1.12.0
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
sklearn==0.0.post1
smmap==5.0.0
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tokenizers==0.13.2
torch==1.10.2+cu113
torchaudio==0.10.2+cu113
torchvision==0.11.3+cu113
tqdm==4.63.0
transformers==4.25.1
typing_extensions==4.1.1
urllib3==1.26.13
wandb==0.13.7
wcwidth==0.2.5
Werkzeug==2.0.3
xxhash==3.1.0
yarl==1.8.2
zipp==3.7.0
Notes:
- The error occurs in training and in evaluation.
- In order to eliminate the option of timeout I deliberately fixed very high value for timeout.
- I tried to run using
torchrun
and usingtorch.distributed.launch
and faced the same issue. - The number of samples in my training/eval doesnât affect and the issue remain.
- I track my memory usage and OOM is not the case here (kinda wish it was).
Would really appreciate any help on this issue!