Quesiton about bf16 in Transformers

Hi,

I tried to use Trainer.train() to train my own model. I set bf16 : bool = True

like this:

@dataclass
class MyTrainingArgs(TrainingArguments):
    # fsdp : str =  "full_shard auto_wrap"  #TODO
    bf16 : bool = True
    bf16_full_eval : bool =True
...

half_precision_backendis default, so it is auto.

But when I was debuging, in the forward phase, I printed the middle output stochatically, they are f32 instead of bf16, such as:

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        print(x.dtype())
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

I checked the trainer, I think the amp context is related to this part:

... ## this code is from transformers/trainer.py

        with cp_context():
            model.train()
            if hasattr(self.optimizer, "train") and callable(self.optimizer.train):
                self.optimizer.train()

            inputs = self._prepare_inputs(inputs)
            if is_sagemaker_mp_enabled():
                loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps)
                return loss_mb.reduce_mean().detach().to(self.args.device)

            with self.compute_loss_context_manager():
                loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)

    def compute_loss_context_manager(self):
        """
        A helper wrapper to group together context managers.
        """
        ctx_stack = contextlib.ExitStack()

        autocast_ctx = self.autocast_smart_context_manager()
        if not isinstance(autocast_ctx, contextlib.nullcontext):
            ctx_stack.enter_context(autocast_ctx)

        return ctx_stack

And I think compute_loss_context_manager is not related to use cuda amp, so my question is that based on the printed results I think I didn’t use bf16 successfully. How Transformers uses cuda amp?

My Transformers version is ‘4.56.2’.My GPU support ‘bf16’”

>>>torch.cuda.is_bf16_supported()
True

Thank you !

1 Like

How Transformers uses cuda amp?

Maybe via Accelerate library.


You’re seeing float32 because CUDA AMP is entered by Accelerate, not by compute_loss_context_manager, and AMP is per-op. Params stay fp32; many ops run bf16; some ops stay fp32 by design. Verify the autocast context inside your forward, not by inspecting a random tensor. In 4.35+ this is the intended flow. (GitHub)

What Trainer actually does in 4.56.2

  • TrainingArguments(bf16=True | fp16=True)Trainer builds an Accelerator with mixed_precision="bf16"| "fp16" and runs your model inside accelerator.autocast(...). CUDA autocast is managed by Accelerate, not by Trainer.autocast_smart_context_manager. (Hugging Face)
  • AMP semantics: with fp16, Accelerate also uses a GradScaler; with bf16, no scaler is needed due to the fp32-like exponent. (PyTorch AMP docs updated 2025-06-12; DeepSpeed page also states no scaling for bf16.) (PyTorch Docs)
  • Some ops are kept in fp32 for stability (e.g., LayerNorm, reductions). Seeing occasional fp32 activations is expected under autocast. (Hugging Face)

Fixes and checks

  1. Print the context, not just a tensor dtype. Also fix the call: x.dtype is a property, not a function.
# refs:
# - PyTorch AMP: https://pytorch.org/docs/stable/amp.html
import torch, torch.nn as nn

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc   = nn.Linear(config.n_embd, 4*config.n_embd, bias=config.bias)
        self.gelu   = nn.GELU()
        self.c_proj = nn.Linear(4*config.n_embd, config.n_embd, bias=config.bias)
        self.dropout= nn.Dropout(config.dropout)

    def forward(self, x):
        print("autocast:", torch.is_autocast_enabled(),
              "gpu_dtype:", torch.get_autocast_gpu_dtype() if torch.cuda.is_available() else None)
        x = self.c_fc(x)
        print("after c_fc:", x.dtype)   # expect torch.bfloat16 under bf16 autocast on CUDA
        x = self.gelu(x); x = self.c_proj(x); x = self.dropout(x)
        return x

This should show autocast: True and gpu_dtype: torch.bfloat16 during Trainer.train() with bf16=True. (PyTorch Docs)

  1. Keep Accelerate up to date. Trainer is “powered by Accelerate.” Install or upgrade it so the mixed-precision path is active:
pip install -U accelerate

Docs explicitly note the dependency. (Hugging Face)

  1. Evaluation dtype is separate. bf16_full_eval=True makes eval run under bf16; otherwise eval is fp32. That can explain fp32 prints during Trainer.evaluate. (Hugging Face)

  2. DeepSpeed users: if you pass deepspeed=..., set bf16 in the DS JSON or it won’t engage bf16 AMP:

{ "bf16": { "enabled": "auto" } }

DS docs also clarify no loss scaling with bf16. (Hugging Face)

  1. Understand mixed precision vs all-bf16. AMP keeps master weights fp32 and casts per-op. If you truly want params+acts in bf16, you must cast the model and inputs yourself and skip AMP. Trade-offs discussed here. (PyTorch Forums)

Why your snippet showed fp32

  • You likely printed outside the autocast block or inspected an op that remains fp32 under AMP. That matches the 4.35+ change where CUDA autocast moved under Accelerate instead of compute_loss_context_manager. Issues #27760 and #28512 document this migration and confusion. Dates: 2023-11-29 and 2024-01-15. (GitHub)

Minimal checklist

  • bf16=True in TrainingArguments.
  • Accelerate installed and current; let Trainer drive accelerator.autocast. (Hugging Face)
  • Verify at runtime with torch.is_autocast_enabled() and torch.get_autocast_gpu_dtype(). Expect bfloat16. (PyTorch Docs)
  • If using DeepSpeed, add the "bf16" block. (Hugging Face)

Curated references

GitHub issues

  • CUDA AMP handled by Accelerate, not autocast_smart_context_manager (behavior change in 4.35.x). Nov 29, 2023; Jan 15, 2024. (GitHub)

Hugging Face docs

  • Trainer guide, explicitly “powered by Accelerate.” accessed 2025-10-12. (Hugging Face)
  • Accelerate Accelerator: mixed_precision and autocast API. current page. (Hugging Face)
  • DeepSpeed integration: bf16 JSON section and no loss scaling. v4.35.0 page. (Hugging Face)
  • Performance notes on fp16/bf16 and LayerNorm fp32 behavior. v4.17.0 page. (Hugging Face)

PyTorch

  • AMP docs and checks (is_autocast_enabled, get_autocast_gpu_dtype). updated 2025-06-12. (PyTorch Docs)

If you paste your exact TrainingArguments and whether you use DeepSpeed or FSDP, I’ll point to the specific setting that’s masking bf16 in your run.

1 Like

Thank you, John!

When I run accelerate launch --miexed-precision bf16 my_code.py , it works!

1 Like