python: 3.9.7
accelerate: 0.12.0
I wanted to double check here about accelerate support for complex (i.e. real + imaginary) data types like ComplexFloat. My guess is that it鈥檚 not as I get the following error, but I wanted to verify that here:
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 523, in <module>
run_training_pipeline(tmp_dict)
File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 523, in <module>
run_training_pipeline(tmp_dict)
File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 133, in run_training_pipeline
train_dataloader, eval_dataloader, rffp_model, optimizer, lr_scheduler, train_progress_bar = rffp_run.prepare_for_training(
File "/home/aclifton/rf_fp/rffprun.py", line 243, in prepare_for_training
prepared_objs = self.accelerator.prepare(*args)
File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 133, in run_training_pipeline
train_dataloader, eval_dataloader, rffp_model, optimizer, lr_scheduler, train_progress_bar = rffp_run.prepare_for_training(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 621, in prepare
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 621, in <genexpr>
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/home/aclifton/rf_fp/rffprun.py", line 243, in prepare_for_training
prepared_objs = self.accelerator.prepare(*args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 518, in _prepare_one
return self.prepare_model(obj)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 645, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 621, in prepare
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 648, in __init__
_sync_module_states(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 621, in <genexpr>
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/utils.py", line 113, in _sync_module_states
_sync_params_and_buffers(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/utils.py", line 131, in _sync_params_and_buffers
dist._broadcast_coalesced(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 518, in _prepare_one
return self.prepare_model(obj)
RuntimeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 645, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 648, in __init__
_sync_module_states(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/utils.py", line 113, in _sync_module_states
_sync_params_and_buffers(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/utils.py", line 131, in _sync_params_and_buffers
dist._broadcast_coalesced(
RuntimeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
wandb: Waiting for W&B process to finish... (failed 1).
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20221019_162054-339fnsx1
wandb: Find logs at: ./wandb/offline-run-20221019_162054-339fnsx1/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20221019_162054-3vrw0qu9
wandb: Find logs at: ./wandb/offline-run-20221019_162054-3vrw0qu9/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2461645) of binary: /home/aclifton/anaconda3/envs/rffp/bin/python
Traceback (most recent call last):
File "/home/aclifton/anaconda3/envs/rffp/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_training_w_evaluate.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-10-19_16:21:06
host : silver-surfer.airlab.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2461646)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-10-19_16:21:06
host : silver-surfer.airlab.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2461645)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/home/aclifton/anaconda3/envs/rffp/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 831, in launch_command
multi_gpu_launcher(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 450, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '2', 'run_training_w_evaluate.py']' returned non-zero exit status 1.
Thanks in advance for your help!