ComplexFloat support in accelerate

aclifton314 · October 19, 2022, 10:25pm

python: 3.9.7
accelerate: 0.12.0

I wanted to double check here about accelerate support for complex (i.e. real + imaginary) data types like ComplexFloat. My guess is that it’s not as I get the following error, but I wanted to verify that here:

Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 523, in <module>
    run_training_pipeline(tmp_dict)
  File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 523, in <module>
    run_training_pipeline(tmp_dict)
  File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 133, in run_training_pipeline
    train_dataloader, eval_dataloader, rffp_model, optimizer, lr_scheduler, train_progress_bar = rffp_run.prepare_for_training(
  File "/home/aclifton/rf_fp/rffprun.py", line 243, in prepare_for_training
    prepared_objs = self.accelerator.prepare(*args)
  File "/home/aclifton/rf_fp/run_training_w_evaluate.py", line 133, in run_training_pipeline
    train_dataloader, eval_dataloader, rffp_model, optimizer, lr_scheduler, train_progress_bar = rffp_run.prepare_for_training(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 621, in prepare
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 621, in <genexpr>
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/home/aclifton/rf_fp/rffprun.py", line 243, in prepare_for_training
    prepared_objs = self.accelerator.prepare(*args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 518, in _prepare_one
    return self.prepare_model(obj)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 645, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 621, in prepare
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 648, in __init__
    _sync_module_states(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 621, in <genexpr>
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/utils.py", line 113, in _sync_module_states
    _sync_params_and_buffers(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/utils.py", line 131, in _sync_params_and_buffers
    dist._broadcast_coalesced(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 518, in _prepare_one
    return self.prepare_model(obj)
RuntimeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/accelerator.py", line 645, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 648, in __init__
    _sync_module_states(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/utils.py", line 113, in _sync_module_states
    _sync_params_and_buffers(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/utils.py", line 131, in _sync_params_and_buffers
    dist._broadcast_coalesced(
RuntimeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
wandb: Waiting for W&B process to finish... (failed 1).
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20221019_162054-339fnsx1
wandb: Find logs at: ./wandb/offline-run-20221019_162054-339fnsx1/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20221019_162054-3vrw0qu9
wandb: Find logs at: ./wandb/offline-run-20221019_162054-3vrw0qu9/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2461645) of binary: /home/aclifton/anaconda3/envs/rffp/bin/python
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_training_w_evaluate.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-10-19_16:21:06
  host      : silver-surfer.airlab.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2461646)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-10-19_16:21:06
  host      : silver-surfer.airlab.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2461645)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 831, in launch_command
    multi_gpu_launcher(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 450, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '2', 'run_training_w_evaluate.py']' returned non-zero exit status 1.

Thanks in advance for your help!

muellerzr · October 19, 2022, 11:16pm

Active bug in PyTorch NCCL Backend does not support ComplexFloat data type · Issue #71613 · pytorch/pytorch · GitHub

aclifton314 · October 20, 2022, 4:10pm

@muellerzr Thanks! I’ll keep an eye on it!

Topic		Replies	Views
Scikit-learn DummyClassifier error when running Accelerate 🤗Accelerate	4	901	August 1, 2022
Errors when using gradient accumulation with FSDP + PEFT LoRA + SFTTrainer 🤗Accelerate	2	1126	February 6, 2025
[ RuntimeError: expected scalar type BFloat16 but found Half ] 🔒 Gradio	0	1213	October 11, 2022
Unexpected .item on an int when using accelerate HF trainer with multiple GPUs only, how to fix? Beginners	1	203	October 4, 2023
Run_backward: expected dtype Float but got dtype Long Intermediate	4	984	July 3, 2024

ComplexFloat support in accelerate

Related topics