torch.distributed.elastic.multiprocessing.errors.ChildFailedError

IdoAmit198 · December 21, 2022, 6:45pm

Hi everyone,
For quite long time I’m struggling with some weird issue regards distributed train/eval.

I’m running a slightly modified version of run_clm.py script with vary number of A100 GPUs (4-8) on 1 node, and keep getting the ChildFailedError right after the training/evaluation ends.
I’m running GPT2 (smallest model) on the OpenWebText dataset

I’m running my code as follow:

torchrun
–standalone
–nnodes=1
–nproc_per_node=NUM_GPU
run_clm.py
–model_name_or_path {MODEL}
–dataset_name {DS_NAME}
–preprocessing_num_workers 16
–logging_steps 5000
–save_steps {SAVE_STEPS}
–do_eval
–per_device_eval_batch_size {EVAL_BATCH}
–seed {RANDOM}
–evaluation_strategy steps
–logging_dir {OUTPUT_DIR}
–output_dir {OUTPUT_DIR}
–overwrite_output_dir
–ddp_timeout 324000
–ddp_find_unused_parameters False
–report_to wandb
–max_eval_samples {MAX_EVAL_SAMPLES}
–run_name openwebtext_inference \

And getting the follow error:

warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2600 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 4 (pid: 2601) of binary: /venv/bin/python
Traceback (most recent call last):
File “/usr/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/venv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 193, in
main()
File “/venv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 189, in main
launch(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 174, in launch
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_12:56:39
host : december-ds-2h4b6-5hpkj
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 2601)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 2601
============================================================

My whole log is as follow:

12/21/2022 12:08:21 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 12:08:21 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 12:08:21 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 12:08:21 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 12:08:21 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=324000,
debug=,
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=5000,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=,
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5000,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[‘wandb’],
resume_from_checkpoint=None,
run_name=openwebtext_inference,
save_on_each_node=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=10366,
sharded_ddp=,
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
12/21/2022 12:08:21 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
[INFO|configuration_utils.py:654] 2022-12-21 12:08:21,429 >> loading configuration file config.json from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-21 12:08:21,430 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|tokenization_auto.py:449] 2022-12-21 12:08:21,756 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:654] 2022-12-21 12:08:22,061 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-21 12:08:22,062 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file vocab.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file merges.txt from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file tokenizer.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 12:08:22,754 >> loading file tokenizer_config.json from cache at None
[INFO|configuration_utils.py:654] 2022-12-21 12:08:22,761 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-21 12:08:22,761 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|modeling_utils.py:2204] 2022-12-21 12:08:26,315 >> loading weights file pytorch_model.bin from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
[INFO|modeling_utils.py:2708] 2022-12-21 12:08:32,476 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:2716] 2022-12-21 12:08:32,476 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
12/21/2022 12:27:56 - INFO - main - *** Evaluate ***
[INFO|trainer.py:703] 2022-12-21 12:27:56,951 >> The following columns in the evaluation set don’t have a corresponding argument in GPT2LMHeadModel.forward and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by GPT2LMHeadModel.forward, you can safely ignore this message.
[INFO|trainer.py:2944] 2022-12-21 12:27:56,954 >> ***** Running Evaluation *****
[INFO|trainer.py:2946] 2022-12-21 12:27:56,954 >> Num examples = 20000
[INFO|trainer.py:2949] 2022-12-21 12:27:56,954 >> Batch size = 8
0%| | 0/500 [00:00<?, ?it/s]
…
…
…
100%|██████████| 500/500 [08:53<00:00, 1.07s/it]
==========================================================
warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2600 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 4 (pid: 2601) of binary: /venv/bin/python
Traceback (most recent call last):
File “/usr/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/venv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 193, in
main()
File “/venv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 189, in main
launch(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 174, in launch
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_12:56:39
host : december-ds-2h4b6-5hpkj
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 2601)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 2601

I’m also trying to add debug flags and inject assertion in different potentially places to verify the exact location it crashes, and will update it more information later on.

Versions (according to `pip freeze --local`):

absl-py==1.0.0
accelerate==0.15.0
aiohttp==3.8.3
aiosignal==1.3.1
async-timeout==4.0.2
attrs==22.1.0
blessed==1.19.1
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
click==8.1.3
datasets==2.7.1
deepspeed==0.6.0
dill==0.3.6
docker-pycreds==0.4.0
evaluate==0.4.0
fairscale==0.4.6
filelock==3.8.2
frozenlist==1.3.3
fsspec==2022.11.0
gitdb==4.0.10
GitPython==3.1.29
google-auth==2.6.0
google-auth-oauthlib==0.4.6
gpustat==1.0.0
grpcio==1.44.0
hjson==3.0.2
huggingface-hub==0.11.1
idna==3.3
importlib-metadata==4.11.2
joblib==1.2.0
Markdown==3.3.6
model-compression-research @ file:///
multidict==6.0.3
multiprocess==0.70.14
ninja==1.10.2.3
nltk==3.8
numpy==1.22.3
nvidia-ml-py==11.495.46
oauthlib==3.2.0
packaging==21.3
pandas==1.5.2
pathtools==0.1.2
Pillow==9.0.1
pkg_resources==0.0.0
promise==2.3
protobuf==3.19.4
psutil==5.9.0
py-cpuinfo==8.0.0
pyarrow==10.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.7
PyYAML==6.0
regex==2022.10.31
requests==2.27.1
requests-oauthlib==1.3.1
responses==0.18.0
rsa==4.8
scikit-learn==1.2.0
scipy==1.9.3
sentencepiece==0.1.97
sentry-sdk==1.12.0
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
sklearn==0.0.post1
smmap==5.0.0
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tokenizers==0.13.2
torch==1.10.2+cu113
torchaudio==0.10.2+cu113
torchvision==0.11.3+cu113
tqdm==4.63.0
transformers==4.25.1
typing_extensions==4.1.1
urllib3==1.26.13
wandb==0.13.7
wcwidth==0.2.5
Werkzeug==2.0.3
xxhash==3.1.0
yarl==1.8.2
zipp==3.7.0

Notes:

The error occurs in training and in evaluation.
In order to eliminate the option of timeout I deliberately fixed very high value for timeout.
I tried to run using torchrun and using torch.distributed.launch and faced the same issue.
The number of samples in my training/eval doesn’t affect and the issue remain.
I track my memory usage and OOM is not the case here (kinda wish it was).

Would really appreciate any help on this issue!

IdoAmit198 · December 21, 2022, 10:10pm

Hey guys, I’m adding the log when running with the debugging flags:

export TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO

log:

12/21/2022 20:37:39 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=324000,
debug=,
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=5000,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=,
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5000,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[‘wandb’],
resume_from_checkpoint=None,
run_name=openwebtext_inference,
save_on_each_node=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=6298,
sharded_ddp=,
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
12/21/2022 20:37:39 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
[INFO|configuration_utils.py:654] 2022-12-21 20:37:40,050 >> loading configuration file config.json from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file vocab.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file merges.txt from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file tokenizer.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file tokenizer_config.json from cache at None
[INFO|configuration_utils.py:654] 2022-12-21 20:37:41,351 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|modeling_utils.py:2204] 2022-12-21 20:37:44,384 >> loading weights file pytorch_model.bin from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
[INFO|modeling_utils.py:2708] 2022-12-21 20:37:50,669 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
======================================================
[INFO|modeling_utils.py:2716] 2022-12-21 20:37:50,669 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4041:4041 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4042:4042 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4044:4044 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
======================================================
december-ds-2h4b6-5hpkj:4045:4045 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
======================================================
december-ds-2h4b6-5hpkj:4043:4043 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Setting affinity for GPU 4 to ffff,fff00000,00ffffff,f0000000
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Trees [0] -1/-1/-1->4->3 [1] -1/-1/-1->4->3
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Setting affinity for GPU 5 to ffff,fff00000,00ffffff,f0000000
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 00 : 2[3e000] → 3[88000] via direct shared memory
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 00 : 4[89000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 01 : 2[3e000] → 3[88000] via direct shared memory
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 01 : 4[89000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 00 : 0[1b000] → 1[3d000] via direct shared memory
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 00 : 1[3d000] → 2[3e000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 01 : 0[1b000] → 1[3d000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 00 : 3[88000] → 4[89000] via P2P/IPC
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 01 : 1[3d000] → 2[3e000] via P2P/IPC
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 01 : 3[88000] → 4[89000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 00 : 4[89000] → 3[88000] via P2P/IPC
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 01 : 4[89000] → 3[88000] via P2P/IPC
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 00 : 1[3d000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 01 : 1[3d000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 00 : 3[88000] → 2[3e000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 01 : 3[88000] → 2[3e000] via direct shared memory
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 00 : 2[3e000] → 1[3d000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 01 : 2[3e000] → 1[3d000] via P2P/IPC
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO comm 0x7f97e8002fb0 rank 1 nranks 5 cudaDev 1 busId 3d000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO comm 0x7f5d80002fb0 rank 2 nranks 5 cudaDev 2 busId 3e000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO comm 0x7fee20002fb0 rank 3 nranks 5 cudaDev 3 busId 88000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO comm 0x7f21a4002fb0 rank 0 nranks 5 cudaDev 0 busId 1b000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO comm 0x7f7078002fb0 rank 4 nranks 5 cudaDev 4 busId 89000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Launch mode Parallel
12/21/2022 20:56:29 - INFO - main - *** Evaluate ***
[INFO|trainer.py:703] 2022-12-21 20:56:29,129 >> The following columns in the evaluation set don’t have a corresponding argument in GPT2LMHeadModel.forward and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by GPT2LMHeadModel.forward, you can safely ignore this message.
[INFO|trainer.py:2944] 2022-12-21 20:56:29,133 >> ***** Running Evaluation *****
[INFO|trainer.py:2946] 2022-12-21 20:56:29,133 >> Num examples = 88340
[INFO|trainer.py:2949] 2022-12-21 20:56:29,133 >> Batch size = 8
0%| | 0/2209 [00:00<?, ?it/s]
0%| | 2/2209 [00:00<18:23, 2.00it/s]
0%| | 3/2209 [00:02<26:56, 1.36it/s]
0%| | 4/2209 [00:03<31:30, 1.17it/s]
…
…
100%|██████████| 2209/2209 [39:03<00:00, 1.06s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4041 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4042 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4043 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4045 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 4044) of binary: /venv/bin/python3
Traceback (most recent call last):
File “/venv/bin/torchrun”, line 8, in
sys.exit(main())
File “/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 719, in main
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_21:57:29
host : december-ds-2h4b6-5hpkj
rank : 3 (local_rank: 3)
exitcode : -9 (pid: 4044)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 4044
============================================================

Hope it helps and somebody can share of his knowledge about it

IdoAmit198 · December 23, 2022, 6:13am

Hey everyone,
Anyone might know something about?

IdoAmit198 · December 26, 2022, 8:04am

Hey guys,

Tried to upgrade my torch version to 1.12.1+cu113.
The error remains, but there are other details in the warnings which I still don’t understand and may be connected to my issue:

[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:29400.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:29400 on [localhost]:51966.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:29400 on [localhost]:51980.
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:40069.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43714.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43722.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43734.
[I ProcessGroupNCCL.cpp:587] [Rank 4] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 4] NCCL watchdog thread started!
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43726.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43742.
[I ProcessGroupNCCL.cpp:751] [Rank 3] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43754.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43758.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43770.
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43776.
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43778.
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
12/24/2022 17:09:12 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=324000,
debug=,
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=5000,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=,
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/gpt2/distributed_inference,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5000,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/gpt2/distributed_inference,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[‘wandb’],
resume_from_checkpoint=None,
run_name=openwebtext-distributed-inference,
save_on_each_node=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=2520,
sharded_ddp=,
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
[INFO|configuration_utils.py:654] 2022-12-24 17:09:12,874 >> loading configuration file config.json from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-24 17:09:12,875 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|tokenization_auto.py:449] 2022-12-24 17:09:13,190 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:654] 2022-12-24 17:09:13,508 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-24 17:09:13,509 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file vocab.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file merges.txt from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file tokenizer.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file tokenizer_config.json from cache at None
[INFO|configuration_utils.py:654] 2022-12-24 17:09:14,133 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-24 17:09:14,134 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|modeling_utils.py:2204] 2022-12-24 17:09:17,271 >> loading weights file pytorch_model.bin from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
[I ProcessGroupNCCL.cpp:2012] Rank 4 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[I ProcessGroupNCCL.cpp:2012] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[I ProcessGroupNCCL.cpp:2012] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[INFO|modeling_utils.py:2708] 2022-12-24 17:09:19,025 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:2716] 2022-12-24 17:09:19,026 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
[I ProcessGroupNCCL.cpp:2012] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[I ProcessGroupNCCL.cpp:2012] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:282:282 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
distributed-4dc7b-wvh75:286:286 [4] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:284:284 [2] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:285:285 [3] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:283:283 [1] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:286:286 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:286:286 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:283:283 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:286:286 [4] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:286:286 [4] NCCL INFO Using network Socket
distributed-4dc7b-wvh75:283:283 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:283:283 [1] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:283:283 [1] NCCL INFO Using network Socket
distributed-4dc7b-wvh75:285:285 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:285:285 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:285:285 [3] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:285:285 [3] NCCL INFO Using network Socket
distributed-4dc7b-wvh75:284:284 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:284:284 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:284:284 [2] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:284:284 [2] NCCL INFO Using network Socket
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Trees [0] -1/-1/-1->4->3 [1] -1/-1/-1->4->3
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Setting affinity for GPU 4 to ffff,fff00000,00ffffff,f0000000
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Setting affinity for GPU 5 to ffff,fff00000,00ffffff,f0000000
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Channel 00 : 2[3e000] → 3[88000] via direct shared memory
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Channel 00 : 4[89000] → 0[1b000] via direct shared memory
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Channel 01 : 2[3e000] → 3[88000] via direct shared memory
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Channel 01 : 4[89000] → 0[1b000] via direct shared memory
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Channel 00 : 1[3d000] → 2[3e000] via P2P/IPC
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Channel 01 : 1[3d000] → 2[3e000] via P2P/IPC
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Channel 00 : 3[88000] → 4[89000] via P2P/IPC
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Channel 01 : 3[88000] → 4[89000] via P2P/IPC
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Channel 00 : 0[1b000] → 1[3d000] via direct shared memory
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Channel 01 : 0[1b000] → 1[3d000] via direct shared memory
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Channel 00 : 3[88000] → 2[3e000] via direct shared memory
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Channel 01 : 3[88000] → 2[3e000] via direct shared memory
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Channel 00 : 4[89000] → 3[88000] via P2P/IPC
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Channel 01 : 4[89000] → 3[88000] via P2P/IPC
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Channel 00 : 1[3d000] → 0[1b000] via direct shared memory
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Channel 01 : 1[3d000] → 0[1b000] via direct shared memory
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Channel 00 : 2[3e000] → 1[3d000] via P2P/IPC
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Channel 01 : 2[3e000] → 1[3d000] via P2P/IPC
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO comm 0x7fc5c80030d0 rank 3 nranks 5 cudaDev 3 busId 88000 - Init COMPLETE
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO comm 0x7f75e80030d0 rank 1 nranks 5 cudaDev 1 busId 3d000 - Init COMPLETE
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO comm 0x7f6f700030d0 rank 2 nranks 5 cudaDev 2 busId 3e000 - Init COMPLETE
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO comm 0x7f150c0030d0 rank 4 nranks 5 cudaDev 4 busId 89000 - Init COMPLETE
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO comm 0x7f53ec0030d0 rank 0 nranks 5 cudaDev 0 busId 1b000 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO Launch mode Parallel
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
12/24/2022 17:12:10 - INFO - main - *** Evaluate ***
[INFO|trainer.py:703] 2022-12-24 17:12:10,400 >> The following columns in the evaluation set don’t have a corresponding argument in GPT2LMHeadModel.forward and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by GPT2LMHeadModel.forward, you can safely ignore this message.
[INFO|trainer.py:2944] 2022-12-24 17:12:10,405 >> ***** Running Evaluation *****
[INFO|trainer.py:2946] 2022-12-24 17:12:10,405 >> Num examples = 88340
[INFO|trainer.py:2949] 2022-12-24 17:12:10,405 >> Batch size = 4
0%| | 0/4417 [00:00<?, ?it/s]
0%| | 2/4417 [00:00<21:13, 3.47it/s]
…
…
100%|█████████▉| 4416/4417 [43:27<00:00, 1.69it/s]
100%|██████████| 4417/4417 [43:28<00:00, 1.69it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 282 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 284 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 286 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 283) of binary: /venv/bin/python3
Traceback (most recent call last):
File “/venv/bin/torchrun”, line 8, in
sys.exit(main())
File “/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 761, in main
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 752, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-24_17:58:51
host : distributed-4dc7b-wvh75
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 283)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 283
============================================================

Hope anyone can help me to locate the issue

ekjot1999 · January 12, 2023, 5:41pm

hi @IdoAmit198 , i’m facing the same issue, have u resolved this issue?

jxm · March 23, 2023, 2:49am

I had this problem too, no easy fix.

IdoAmit198 · March 26, 2023, 6:35am

Hey guys, I’m glad to announce I solved the issue on my side.
As can be seen I use multiple GPUs, which have sufficient memory for the use case.
HOWEVER! My issue was due to not enough CPU memory. That’s why my runs crashed and without any trace of the reason.
Once I allocated enough cpu (on my case I increased it from 32GB to 96+ GB).

If the CPU allocation is constant and you can not allocated more, I’m sure you can try solutions as compressed models, deepspeed optimization levels and more.

Good luck to future readers.

elleza · April 23, 2023, 9:28am

I have same problem.
Is it mean you added DDR4 memory?

IdoAmit198 · April 23, 2023, 11:48am

I added RAM memory.

elleza · April 23, 2023, 2:00pm

I understand. Thank you.

AnustupOCR · July 26, 2023, 8:04am

Hi, I have been facing the same Issue, In my case, I am fine tuning TrOCR model to work on other languages, wherein i have swapped out the encoder and decoder.

The code runs perfectly fine on single GPU, but when i tried to train the model over multiple GPU’s with accelerate library, i am facing the exact same error.

The error occurs randomly, mostly after the validation end and next epoch starts. I am not being able to figure out the exact reason.

I am running the program in a docker and i checked the stats, memory(RAM) should not be an issue as 1.8TB is free in my case.

I am recieving the exact same error with the same exitcode : -9.

Please help me with any possible solutions.

shane131214 · August 2, 2023, 4:24am

I have the same issue. I solve this by increasing the CPU memory.

AnustupOCR · August 2, 2023, 9:02am

Yeah, my issue got resolved after specifying the docker memory and properly specifying the device id’s with CUDA_VISIBLE_DEVICES=…

thedaffodil · September 11, 2023, 6:39pm

hello which command did you use? I’m trying to run this https://github.com/snap-stanford/med-flamingo/blob/master/scripts/demo.py file on 2 gpu, but got same error with exit code:-9

rajsabi · September 13, 2023, 12:10pm

i am also facing the same error
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
i am trying to fine tune trocr on bangla language with 23 lakh images. trying to fine tune on 4 gpus. facing this error after 1 epoch.
How can I increase cpu memory in my case?
will increasing gpu numbers and decreasing batch_size will help?
NAME=“Ubuntu”
VERSION=“20.04.2 LTS (Focal Fossa)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 20.04.2 LTS”
VERSION_ID=“20.04”
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

any help…
thanks!

IdoAmit198 · September 26, 2023, 8:36am

Since you’re working in ubonto environment you can actually monitor your CPU & GPU usage quite easily. I would suggest you to try the following:

Read about screen/tmux commands on how to split the terminal to panes so each pane would monitor one of the specs. Also look at gpustat in order to monitor gpu usage in real time (I usually use the command as gpustat --watch -n 1). Lastly, read about the command htop to monitor CPU usage in your machine.
Split your terminal to 3 panes with tmux/screen - 1)the actual coding area ; 2) to monitor the GPU usage with gpustat ; 3) to monitor CPU usage using htop command.
Run your code and monitor the gpu/cpu consumption.

Feel free to report your observations and we can think together about the cause to your training crash and how to figure it out

rajsabi · September 27, 2023, 5:49am

Thanks for reply @IdoAmit198 . I will try this too. Actually my code worked. I found out that the transformer version (4.32.1) I was using, was not supporting some preprocessing on gray scale images. Now it is working on version=4.24.0.

appledev1234 · March 30, 2024, 7:13am

Just use 4.26.1 version of transformers:
pip install git+https://github.com/huggingface/transformers.git@v4.26.1

shirubei · December 9, 2024, 11:36am

In my case, I changed pytorch version from 2.3.0 to 2.0.0 with same transformers version and no error any more.

christineac · January 22, 2025, 11:06am

@shirubei what’s your version of transformers?

Topic		Replies	Views
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	702	August 15, 2024
Error when fine-tuning on multi-gpu 🤗Transformers	1	873	February 17, 2025
Run crash with all GPU's and success with less 🤗Transformers	0	429	December 12, 2022
RuntimeError: arguments are located on different GPUs 🤗Transformers	2	1870	October 24, 2020
Errors when training on multi node single gpu 🤗Transformers	1	1786	February 25, 2022