Feature Request: Add DDP Communication Hooks

Younghwan235t8960u · June 9, 2024, 8:36am

Motivation

I would like to request the addition of DDP communication hooks to the accelerate library. This feature enhances performance in distributed training by allowing control over how gradients are communicated across workers. Frameworks like PyTorch Lightning and Detectron2 use these hooks to reduce communication overhead and speed up training. Adding this capability to accelerate would provide similar performance benefits to its users.

Feature Description

Introduce support for DDP communication hooks such as PowerSGD, FP16, and BF16 in the accelerate library. Users can select and apply these hooks to optimize gradient communication in their distributed training models.

Example Code Snippet

Here is an example of how this feature can be used in accelerate:

from accelerate import Accelerator, DDPCommunicationHookType, DistributedDataParallelKwargs

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(10, 10)

    def forward(self, x):
        return self.layer(x)

ddp_kwargs = DistributedDataParallelKwargs(
    comm_hook=DDPCommunicationHookType.FP16,
)

accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])
model = accelerator.prepare(MyModel())

# Training loop
for data in data_loader:
    outputs = model(data)
    loss = criterion(outputs, targets)
    accelerator.backward(loss)
    optimizer.step()
    optimizer.zero_grad()

For reference, here is how Detectron2 registers a DDP communication hook:

github.com

facebookresearch/detectron2/blob/9ba16d6d9252fd4d820fd1317a41af6bd9aefd23/detectron2/engine/defaults.py#L78


      
                  kwargs: other arguments of :module:`torch.nn.parallel.DistributedDataParallel`.
              """  # noqa
              if comm.get_world_size() == 1:
                  return model
              if "device_ids" not in kwargs:
                  kwargs["device_ids"] = [comm.get_local_rank()]
              ddp = DistributedDataParallel(model, **kwargs)
              if fp16_compression:
                  from torch.distributed.algorithms.ddp_comm_hooks import default as comm_hooks
          
                  ddp.register_comm_hook(state=None, hook=comm_hooks.fp16_compress_hook)
              return ddp
          
          
          def default_argument_parser(epilog=None):
              """
              Create a parser with some common arguments used by detectron2 users.
          
              Args:
                  epilog (str): epilog passed to ArgumentParser describing the usage.

def create_ddp_model(model, *, fp16_compression=False, **kwargs):
    if comm.get_world_size() == 1:
        return model
    if "device_ids" not in kwargs:
        kwargs["device_ids"] = [comm.get_local_rank()]
    ddp = DistributedDataParallel(model, **kwargs)
    if fp16_compression:
        from torch.distributed.algorithms.ddp_comm_hooks import default as comm_hooks
        ddp.register_comm_hook(state=None, hook=comm_hooks.fp16_compress_hook)
    return ddp

Thank you for considering this feature request. This addition will help enhance distributed training efficiency in the accelerate library.

muellerzr · June 9, 2024, 1:18pm

Can you open this in the accelerate repo please? The forums are really just for answering q/a that aren’t direct bugs, and its hard for us to keep track of feature requests here

Younghwan235t8960u · June 9, 2024, 1:36pm

I have opened a PR for this . Thanks for your response

github.com/huggingface/accelerate

Add DDP Communication Hooks

huggingface:main ← yhna940:feature/torch-ddp-grad-hook

opened 08:43AM - 09 Jun 24 UTC

yhna940

+174 -23

## What does this PR do? This PR adds support for DDP communication hooks to …the `accelerate` library. Similar to frameworks like PyTorch Lightning and Detectron, these hooks provide an interface to control how gradients are communicated across workers, overriding the standard allreduce in DistributedDataParallel. This feature enables the use of performance-improving communication hooks when using multiple nodes. ### Motivation and Context DDP communication hooks allow users to customize and optimize gradient communication, potentially improving training performance in distributed settings. Based on the official PyTorch documentation [here](https://pytorch.org/docs/stable/ddp_comm_hooks.html), I've implemented three default hooks: PowerSGD, FP16, and BF16. These hooks provide performance improvements in distributed training scenarios. The implementation for registering these hooks was inspired by the PyTorch Lightning implementation, which can be found [here](https://github.com/Lightning-AI/pytorch-lightning/blob/06ea3a05716a6d1f4a96cfb25021accdd18d8146/src/lightning/pytorch/overrides/distributed.py#L61). ## Fixes # (issue) N/A ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/accelerate/blob/main/CONTRIBUTING.md#submitting-a-pull-request-pr), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case: [here](https://discuss.huggingface.co/t/feature-request-add-ddp-communication-hooks/91142/1) - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/accelerate/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/accelerate/tree/main/docs#writing-documentation---specification). - [x] Did you write any new necessary tests?

Topic		Replies	Views
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23106	May 8, 2023
Which data parallel does trainer use? DP or DDP? 🤗Transformers	2	6336	August 17, 2022
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4133	September 10, 2024
DDP Program hang/stuck in trainer.predict() and trainer.evaluate() 🤗Accelerate	2	747	February 15, 2024
Not seeing memory benefit to accelerate/FSDP2 🤗Accelerate	3	36	June 18, 2025

Feature Request: Add DDP Communication Hooks

Motivation

Feature Description

Example Code Snippet

Related topics