Bug when using gradient accumulation with accelerate

enmixkesi · November 7, 2025, 6:54am

in GradientState, the plugin_kwargs will only update when the the kwats of gradient_accumulation_plugin is different default GradientAccumulationPlugin kwargs , and in GradientAccumulationPlugin,the default kwargs is num_steps=none, adjust_scheduler=True, sync_with_dataloader=True, sync_each_batch=False. so when you init GradientState with GradientState(gradient_accumulation_plugin=GradientAccumulationPlugin(num_steps=2)), the GradientState.plugin_kwargs will only get num_steps=2. and the default adjust_scheduler in GradientState is false: @property

def adjust_scheduler(self) -> bool:

    "Returns whether the scheduler should be adjusted"

    return self.plugin_kwargs.get("adjust_scheduler", False)

so when you use gradient accumulation and call AcceleratedScheduler.step, self.gradient_state.adjust_scheduler will always be false but not true. and this is in correct with what you describe in https://huggingface.co/docs/accelerate/package_reference/torch_wrappers: When performing gradient accumulation scheduler lengths should not be changed accordingly, Accelerate will always step the scheduler to account for it.
because the scheduler never count the step.

John6666 · November 7, 2025, 1:59pm

I think it’s a real bug, but there doesn’t seem to be an existing issue. Something like this?

**Title:** `AcceleratedScheduler` doesn’t step during gradient accumulation unless `adjust_scheduler=True` is explicitly passed (docs say it should “always step”)

**Summary**
When using gradient accumulation and setting only `num_steps` (via `Accelerator(gradient_accumulation_steps=...)` or `GradientAccumulationPlugin(num_steps=...)`), the scheduler does **not** advance on accumulation micro-steps. This contradicts the docs which state that Accelerate will **always** step the scheduler to account for accumulation. ([Hugging Face](https://huggingface.co/docs/accelerate/en/package_reference/torch_wrappers "DataLoaders, Optimizers, and Schedulers"))

**What actually happens**

* `GradientState.adjust_scheduler` reads from `plugin_kwargs` and falls back to `False` when the key is missing.
* The `Accelerator.gradient_accumulation_steps` setter only injects `{"num_steps": N}` into `GradientState.plugin_kwargs`. It does **not** inject `adjust_scheduler=True`.
* `AcceleratedScheduler.step()` only increments the underlying scheduler’s `_step_count` during accumulation when `GradientState.adjust_scheduler` is `True`. If it’s `False` or missing, no accumulation-time step is recorded. ([gemfury.com](https://gemfury.com/emaballarin/python%3Aaccelerate/-/content/state.py "state.py · emaballarin / accelerate v1.11.0.dev0 - python ..."))

**Why this contradicts docs**
Docs promise: “When performing gradient accumulation scheduler lengths should not be changed accordingly, Accelerate will **always step** the scheduler to account for it.” Defaults for `GradientAccumulationPlugin` also document `adjust_scheduler=True`. Current runtime violates this unless the flag is explicitly provided. ([Hugging Face](https://huggingface.co/docs/accelerate/en/package_reference/torch_wrappers "DataLoaders, Optimizers, and Schedulers"))

**Minimal repro**

# deps: accelerate>=1.11.0, torch
# refs:
#   Docs (always step): https://huggingface.co/docs/accelerate/en/package_reference/torch_wrappers
#   Plugin defaults:    https://huggingface.co/docs/accelerate/en/package_reference/utilities
from accelerate import Accelerator
from accelerate.utils import GradientAccumulationPlugin
import torch

m = torch.nn.Linear(2, 2)
opt = torch.optim.SGD(m.parameters(), lr=1.0)
sched = torch.optim.lr_scheduler.LinearLR(opt, start_factor=1.0, end_factor=0.5, total_iters=8)

# Only num_steps provided → plugin_kwargs == {"num_steps": 2}
acc = Accelerator(gradient_accumulation_plugin=GradientAccumulationPlugin(num_steps=2))
m, opt, sched = acc.prepare(m, opt, sched)

lrs = []
for _ in range(8):
    with acc.accumulate(m):
        y = m(torch.randn(4, 2)).sum()
        acc.backward(y)
        opt.step(); opt.zero_grad()
        sched.step()  # During accumulation, no _step_count increment if adjust_scheduler=False
    lrs.append(opt.param_groups[0]["lr"])
print(lrs)  # LR decays too slowly vs. per-micro-step stepping

**Expected behavior**

* Scheduler advances on every accumulation micro-step without requiring users to pass `adjust_scheduler=True`, matching the docs and the documented defaults. ([Hugging Face](https://huggingface.co/docs/accelerate/en/package_reference/torch_wrappers "DataLoaders, Optimizers, and Schedulers"))

**Root cause (code-level)**

* `GradientState.adjust_scheduler` → `self.plugin_kwargs.get("adjust_scheduler", False)` defaults to `False` if the key is absent. ([gemfury.com](https://gemfury.com/emaballarin/python%3Aaccelerate/-/content/state.py "state.py · emaballarin / accelerate v1.11.0.dev0 - python ..."))
* `Accelerator.gradient_accumulation_steps` setter updates only `{"num_steps": N}`. ([gemfury.com](https://gemfury.com/emaballarin/python%3Aaccelerate/accelerate-1.12.0.dev0-py3-none-any.whl/content/accelerator.py "accelerator.py · emaballarin / accelerate v1.12.0.dev0 - python package | Gemfury"))
* `AcceleratedScheduler.step()` gates accumulation-time `_step_count` bump on `gradient_state.adjust_scheduler`. ([gemfury.com](https://gemfury.com/emaballarin/python%3Aaccelerate/accelerate-1.12.0.dev0-py3-none-any.whl/content/scheduler.py "scheduler.py · emaballarin / accelerate v1.12.0.dev0 - python package | Gemfury"))

**Proposed fixes**
Any one of these aligns runtime with docs and keeps explicit overrides working:

1. Make the property default `True`:

# src/accelerate/state.py
- return self.plugin_kwargs.get("adjust_scheduler", False)
+ return self.plugin_kwargs.get("adjust_scheduler", True)

(Respects explicit `adjust_scheduler=False`, fixes implicit cases.) ([gemfury.com](https://gemfury.com/emaballarin/python%3Aaccelerate/-/content/state.py "state.py · emaballarin / accelerate v1.11.0.dev0 - python ..."))

2. Materialize documented plugin defaults when no plugin kwargs exist:

# in GradientState.__init__
- self.plugin_kwargs = (gap.to_kwargs() if gap is not None else {})
+ self.plugin_kwargs = (gap.to_kwargs() if gap is not None else {"adjust_scheduler": True, "sync_with_dataloader": True})

Ensures keys exist and match docs by default. ([Hugging Face](https://huggingface.co/docs/accelerate/package_reference/utilities "Utility functions and classes"))

3. When users set `Accelerator(gradient_accumulation_steps=N)`, also inject the documented defaults:

# accelerator.py gradient_accumulation_steps.setter
- self.gradient_state.plugin_kwargs.update({"num_steps": gradient_accumulation_steps})
+ self.gradient_state.plugin_kwargs.update({
+   "num_steps": gradient_accumulation_steps,
+   "adjust_scheduler": True,
+   "sync_with_dataloader": True
+ })

Keeps behavior consistent whether users pass a plugin or a bare integer. ([gemfury.com](https://gemfury.com/emaballarin/python%3Aaccelerate/accelerate-1.12.0.dev0-py3-none-any.whl/content/accelerator.py "accelerator.py · emaballarin / accelerate v1.12.0.dev0 - python package | Gemfury"))

**Workaround for users**
Pass the flag explicitly:

GradientAccumulationPlugin(num_steps=2, adjust_scheduler=True)

This forces correct stepping today and matches the docs’ stated default. ([Hugging Face](https://huggingface.co/docs/accelerate/package_reference/utilities "Utility functions and classes"))

**Environment**
Reproduced with current stable docs and code paths; behavior visible with `accelerate==1.11.x` and main, as of 2025-11-07 JST. Key code paths cited above. ([Hugging Face](https://huggingface.co/docs/accelerate/en/package_reference/torch_wrappers "DataLoaders, Optimizers, and Schedulers"))

**Related but different**

* Prior reports discuss the opposite symptom (“scheduler always steps under accumulation”). Useful historical context but not this bug. ([github.com](https://github.com/huggingface/accelerate/issues/963 "Scheduler always steps when training with gradient ..."))

Topic		Replies	Views
Performing gradient accumulation with Accelerate 🤗Accelerate	3	595	March 4, 2024
Using gradient_accumulation_steps does not give the same results 🤗Accelerate	0	523	February 18, 2023
Is there a way to backpropagate through multiple steps while using Trainer API 🤗Transformers	1	255	July 9, 2021
Custom gradient accumulation scheme in Trainer 🤗Transformers	0	335	June 23, 2023
Questions about steps with gradient accumulation Beginners	1	1033	July 19, 2023

Bug when using gradient accumulation with accelerate

Related topics