KeyError: 'backend' ChildFailedError codeparrot_training.py FAILED

ishaansharma · May 15, 2023, 5:43am

I am running this code from the official repository of the book.
And it is giving the following errors on my system. can anyone have a look into this and point me to the right direction .
This same error is coming while running the nlp_example.py file from the accelerate examples repository .

Steps to reproduce the behavior:

git clone https://huggingface.co/transformersbook/codeparrot
cd codeparrot
pip install -r requirements.txt
wandb login
accelerate config
accelerate launch codeparrot_training.py

✦ 🕙 11:41:06 ❯ accelerate launch codeparrot_training.py
2023-05-10 11:45:58.950271: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on[::ffff:127.0.0.1]:44970.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on[::ffff:127.0.0.1]:44986.
2023-05-10 11:46:03.976884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2023-05-10 11:46:03.993856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
╭────────────────────────── Traceback (most recent call last) ──────────────────────────╮
│ /mnt/ssd2/tf_gpu_docker/ground0/git_repo/codeparrot/codeparrot_training.py:115 in     │
│ <module>                                                                              │
│                                                                                       │
│   112 │   return loss.item(), perplexity.item()                                       │
│   113                                                                                 │
│   114 # Accelerator                                                                   │
│ ❱ 115 accelerator = Accelerator(dispatch_batches=True)                                │
│   116 acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}     │
│   117 # Hyperparameters                                                               │
│   118 project_name = 'transformersbook/codeparrot'                                    │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/accelerator. │
│ py:358 in __init__                                                                    │
│                                                                                       │
│    355 │   │   │   │   │   │   self.fp8_recipe_handler = handler                      │
│    356 │   │                                                                          │
│    357 │   │   kwargs = self.init_handler.to_kwargs() if self.init_handler is not Non │
│ ❱  358 │   │   self.state = AcceleratorState(                                         │
│    359 │   │   │   mixed_precision=mixed_precision,                                   │
│    360 │   │   │   cpu=cpu,                                                           │
│    361 │   │   │   dynamo_plugin=dynamo_plugin,                                       │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:535 │
│ in __init__                                                                           │
│                                                                                       │
│   532 │   │   if parse_flag_from_env("ACCELERATE_USE_CPU"):                           │
│   533 │   │   │   cpu = True                                                          │
│   534 │   │   if PartialState._shared_state == {}:                                    │
│ ❱ 535 │   │   │   PartialState(cpu, **kwargs)                                         │
│   536 │   │   self.__dict__.update(PartialState._shared_state)                        │
│   537 │   │   self._check_initialized(mixed_precision, cpu)                           │
│   538 │   │   if not self.initialized:                                                │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:130 │
│ in __init__                                                                           │
│                                                                                       │
│   127 │   │   │   elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:       │
│   128 │   │   │   │   self.distributed_type = DistributedType.MULTI_GPU               │
│   129 │   │   │   │   if not torch.distributed.is_initialized():                      │
│ ❱ 130 │   │   │   │   │   self.backend = kwargs.pop("backend")                        │
│   131 │   │   │   │   │   torch.distributed.init_process_group(backend=self.backend,  │
│   132 │   │   │   │   self.num_processes = torch.distributed.get_world_size()         │
│   133 │   │   │   │   self.process_index = torch.distributed.get_rank()               │
╰───────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'backend'
╭────────────────────────── Traceback (most recent call last) ──────────────────────────╮
│ /mnt/ssd2/tf_gpu_docker/ground0/git_repo/codeparrot/codeparrot_training.py:115 in     │
│ <module>                                                                              │
│                                                                                       │
│   112 │   return loss.item(), perplexity.item()                                       │
│   113                                                                                 │
│   114 # Accelerator                                                                   │
│ ❱ 115 accelerator = Accelerator(dispatch_batches=True)                                │
│   116 acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}     │
│   117 # Hyperparameters                                                               │
│   118 project_name = 'transformersbook/codeparrot'                                    │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/accelerator. │
│ py:358 in __init__                                                                    │
│                                                                                       │
│    355 │   │   │   │   │   │   self.fp8_recipe_handler = handler                      │
│    356 │   │                                                                          │
│    357 │   │   kwargs = self.init_handler.to_kwargs() if self.init_handler is not Non │
│ ❱  358 │   │   self.state = AcceleratorState(                                         │
│    359 │   │   │   mixed_precision=mixed_precision,                                   │
│    360 │   │   │   cpu=cpu,                                                           │
│    361 │   │   │   dynamo_plugin=dynamo_plugin,                                       │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:535 │
│ in __init__                                                                           │
│                                                                                       │
│   532 │   │   if parse_flag_from_env("ACCELERATE_USE_CPU"):                           │
│   533 │   │   │   cpu = True                                                          │
│   534 │   │   if PartialState._shared_state == {}:                                    │
│ ❱ 535 │   │   │   PartialState(cpu, **kwargs)                                         │
│   536 │   │   self.__dict__.update(PartialState._shared_state)                        │
│   537 │   │   self._check_initialized(mixed_precision, cpu)                           │
│   538 │   │   if not self.initialized:                                                │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:130 │
│ in __init__                                                                           │
│                                                                                       │
│   127 │   │   │   elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:       │
│   128 │   │   │   │   self.distributed_type = DistributedType.MULTI_GPU               │
│   129 │   │   │   │   if not torch.distributed.is_initialized():                      │
│ ❱ 130 │   │   │   │   │   self.backend = kwargs.pop("backend")                        │
│   131 │   │   │   │   │   torch.distributed.init_process_group(backend=self.backend,  │
│   132 │   │   │   │   self.num_processes = torch.distributed.get_world_size()         │
│   133 │   │   │   │   self.process_index = torch.distributed.get_rank()               │
╰───────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'backend'
[11:46:07] ERROR    failed (exitcode: 1) local_rank: 0 (pid: 103878) of        api.py:672
                    binary: /home/anaconda3/envs/lab/bin/python
╭────────────────────────── Traceback (most recent call last) ──────────────────────────╮
│ /home/anaconda3/envs/lab/bin/accelerate:8 in <module>                        │
│                                                                                       │
│   5 from accelerate.commands.accelerate_cli import main                               │
│   6 if __name__ == '__main__':                                                        │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])              │
│ ❱ 8 │   sys.exit(main())                                                              │
│   9                                                                                   │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/acc │
│ elerate_cli.py:45 in main                                                             │
│                                                                                       │
│   42 │   │   exit(1)                                                                  │
│   43 │                                                                                │
│   44 │   # Run                                                                        │
│ ❱ 45 │   args.func(args)                                                              │
│   46                                                                                  │
│   47                                                                                  │
│   48 if __name__ == "__main__":                                                       │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/lau │
│ nch.py:909 in launch_command                                                          │
│                                                                                       │
│   906 │   elif args.use_megatron_lm and not args.cpu:                                 │
│   907 │   │   multi_gpu_launcher(args)                                                │
│   908 │   elif args.multi_gpu and not args.cpu:                                       │
│ ❱ 909 │   │   multi_gpu_launcher(args)                                                │
│   910 │   elif args.tpu and not args.cpu:                                             │
│   911 │   │   if args.tpu_use_cluster:                                                │
│   912 │   │   │   tpu_pod_launcher(args)                                              │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/lau │
│ nch.py:604 in multi_gpu_launcher                                                      │
│                                                                                       │
│   601 │   )                                                                           │
│   602 │   with patch_environment(**current_env):                                      │
│   603 │   │   try:                                                                    │
│ ❱ 604 │   │   │   distrib_run.run(args)                                               │
│   605 │   │   except Exception:                                                       │
│   606 │   │   │   if is_rich_available() and debug:                                   │
│   607 │   │   │   │   console = get_console()                                         │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/run.p │
│ y:785 in run                                                                          │
│                                                                                       │
│   782 │   │   )                                                                       │
│   783 │                                                                               │
│   784 │   config, cmd, cmd_args = config_from_args(args)                              │
│ ❱ 785 │   elastic_launch(                                                             │
│   786 │   │   config=config,                                                          │
│   787 │   │   entrypoint=cmd,                                                         │
│   788 │   )(*cmd_args)                                                                │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/launc │
│ her/api.py:134 in __call__                                                            │
│                                                                                       │
│   131 │   │   self._entrypoint = entrypoint                                           │
│   132 │                                                                               │
│   133 │   def __call__(self, *args):                                                  │
│ ❱ 134 │   │   return launch_agent(self._config, self._entrypoint, list(args))         │
│   135                                                                                 │
│   136                                                                                 │
│   137 def _get_entrypoint_name(                                                       │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/launc │
│ her/api.py:250 in launch_agent                                                        │
│                                                                                       │
│   247 │   │   │   # if the error files for the failed children exist                  │
│   248 │   │   │   # @record will copy the first error (root cause)                    │
│   249 │   │   │   # to the error file of the launcher process.                        │
│ ❱ 250 │   │   │   raise ChildFailedError(                                             │
│   251 │   │   │   │   name=entrypoint_name,                                           │
│   252 │   │   │   │   failures=result.failures,                                       │
│   253 │   │   │   )                                                                   │
╰───────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError:
============================================================
codeparrot_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-05-10_11:46:07
  host      : YODA
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 103879)
  error_file: <N/A>
  traceback : To enable traceback see:
https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-10_11:46:07
  host      : YODA
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 103878)
  error_file: <N/A>
  traceback : To enable traceback see:
https://pytorch.org/docs/stable/elastic/errors.html
============================================================

My accelerate config file :

✦ 🕙 12:13:56 ✖  cat /home/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '[0,1]'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

My GPU details:

✦ 🕙 12:14:05 ❯ nvidia-smi
Wed May 10 12:15:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
| 36%   32C    P8     1W / 250W |     10MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:42:00.0  On |                  N/A |
| 36%   36C    P8    17W / 250W |    418MiB / 11264MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

My CPU Details:

✦2 🕙 11:11:09 ❯ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       1
NUMA node(s):                    4
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           8
Model name:                      AMD Ryzen Threadripper 2970WX 24-Core Processor
Stepping:                        2
Frequency boost:                 enabled
CPU MHz:                         2514.475
CPU max MHz:                     3000.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        5988.41
Virtualization:                  AMD-V
L1d cache:                       768 KiB
L1i cache:                       1.5 MiB
L2 cache:                        12 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-5,24-29
NUMA node1 CPU(s):               12-17,36-41
NUMA node2 CPU(s):               6-11,30-35
NUMA node3 CPU(s):               18-23,42-47
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT vulnera
                                 ble
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user
                                  pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP
                                  disabled, RSB filling, PBRSB-eIBRS Not affecte
                                 d
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtr
                                 r pge mca cmov pat pse36 clflush mmx fxsr sse s
                                 se2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtsc
                                 p lm constant_tsc rep_good nopl nonstop_tsc cpu
                                 id extd_apicid amd_dcm aperfmperf rapl pni pclm
                                 ulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movb
                                 e popcnt aes xsave avx f16c rdrand lahf_lm cmp_
                                 legacy svm extapic cr8_legacy abm sse4a misalig
                                 nsse 3dnowprefetch osvw skinit wdt tce topoext 
                                 perfctr_core perfctr_nb bpext perfctr_llc mwait
                                 x cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1
                                  avx2 smep bmi2 rdseed adx smap clflushopt sha_
                                 ni xsaveopt xsavec xgetbv1 xsaves clzero irperf
                                  xsaveerptr arat npt lbrv svm_lock nrip_save ts
                                 c_scale vmcb_clean flushbyasid decodeassists pa
                                 usefilter pfthreshold avic v_vmsave_vmload vgif
                                  overflow_recov succor smca sme sev sev_es

thedaffodil · August 14, 2023, 5:38pm

Could you find a solution for this error?

Topic		Replies	Views
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	596	August 15, 2024
torch.distributed.elastic.multiprocessing.errors.ChildFailedError 🤗Transformers	19	39904	January 22, 2025
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 10561) of binary 🤗Accelerate	4	4823	January 24, 2024
T5 trainin - ValueError: Failed to find data adapter that can handle input" - help! Beginners	0	1434	August 12, 2022
Trying to run for the first time a model Beginners	0	812	March 8, 2023

KeyError: 'backend' ChildFailedError codeparrot_training.py FAILED

Related topics