Quesiton about bf16 in Transformers

Zoe0427 · October 12, 2025, 3:51am

Hi,

I tried to use Trainer.train() to train my own model. I set bf16 : bool = True

like this:

@dataclass
class MyTrainingArgs(TrainingArguments):
    # fsdp : str =  "full_shard auto_wrap"  #TODO
    bf16 : bool = True
    bf16_full_eval : bool =True
...

half_precision_backendis default, so it is auto.

But when I was debuging, in the forward phase, I printed the middle output stochatically, they are f32 instead of bf16, such as:

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        print(x.dtype())
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

I checked the trainer, I think the amp context is related to this part:

... ## this code is from transformers/trainer.py

        with cp_context():
            model.train()
            if hasattr(self.optimizer, "train") and callable(self.optimizer.train):
                self.optimizer.train()

            inputs = self._prepare_inputs(inputs)
            if is_sagemaker_mp_enabled():
                loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps)
                return loss_mb.reduce_mean().detach().to(self.args.device)

            with self.compute_loss_context_manager():
                loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)

    def compute_loss_context_manager(self):
        """
        A helper wrapper to group together context managers.
        """
        ctx_stack = contextlib.ExitStack()

        autocast_ctx = self.autocast_smart_context_manager()
        if not isinstance(autocast_ctx, contextlib.nullcontext):
            ctx_stack.enter_context(autocast_ctx)

        return ctx_stack

And I think compute_loss_context_manager is not related to use cuda amp, so my question is that based on the printed results I think I didn’t use bf16 successfully. How Transformers uses cuda amp？

My Transformers version is ‘4.56.2’.My GPU support ‘bf16’”

>>>torch.cuda.is_bf16_supported()
True

Thank you !

John6666 · October 12, 2025, 6:17am

How Transformers uses cuda amp？

Maybe via Accelerate library.

You’re seeing float32 because CUDA AMP is entered by Accelerate, not by compute_loss_context_manager, and AMP is per-op. Params stay fp32; many ops run bf16; some ops stay fp32 by design. Verify the autocast context inside your forward, not by inspecting a random tensor. In 4.35+ this is the intended flow. (GitHub)

What Trainer actually does in 4.56.2

TrainingArguments(bf16=True | fp16=True) → Trainer builds an Accelerator with mixed_precision="bf16"| "fp16" and runs your model inside accelerator.autocast(...). CUDA autocast is managed by Accelerate, not by Trainer.autocast_smart_context_manager. (Hugging Face)
AMP semantics: with fp16, Accelerate also uses a GradScaler; with bf16, no scaler is needed due to the fp32-like exponent. (PyTorch AMP docs updated 2025-06-12; DeepSpeed page also states no scaling for bf16.) (PyTorch Docs)
Some ops are kept in fp32 for stability (e.g., LayerNorm, reductions). Seeing occasional fp32 activations is expected under autocast. (Hugging Face)

Fixes and checks

Print the context, not just a tensor dtype. Also fix the call: x.dtype is a property, not a function.

# refs:
# - PyTorch AMP: https://pytorch.org/docs/stable/amp.html
import torch, torch.nn as nn

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc   = nn.Linear(config.n_embd, 4*config.n_embd, bias=config.bias)
        self.gelu   = nn.GELU()
        self.c_proj = nn.Linear(4*config.n_embd, config.n_embd, bias=config.bias)
        self.dropout= nn.Dropout(config.dropout)

    def forward(self, x):
        print("autocast:", torch.is_autocast_enabled(),
              "gpu_dtype:", torch.get_autocast_gpu_dtype() if torch.cuda.is_available() else None)
        x = self.c_fc(x)
        print("after c_fc:", x.dtype)   # expect torch.bfloat16 under bf16 autocast on CUDA
        x = self.gelu(x); x = self.c_proj(x); x = self.dropout(x)
        return x

This should show autocast: True and gpu_dtype: torch.bfloat16 during Trainer.train() with bf16=True. (PyTorch Docs)

Keep Accelerate up to date. Trainer is “powered by Accelerate.” Install or upgrade it so the mixed-precision path is active:

pip install -U accelerate

Docs explicitly note the dependency. (Hugging Face)

Evaluation dtype is separate. bf16_full_eval=True makes eval run under bf16; otherwise eval is fp32. That can explain fp32 prints during Trainer.evaluate. (Hugging Face)
DeepSpeed users: if you pass deepspeed=..., set bf16 in the DS JSON or it won’t engage bf16 AMP:

{ "bf16": { "enabled": "auto" } }

DS docs also clarify no loss scaling with bf16. (Hugging Face)

Understand mixed precision vs all-bf16. AMP keeps master weights fp32 and casts per-op. If you truly want params+acts in bf16, you must cast the model and inputs yourself and skip AMP. Trade-offs discussed here. (PyTorch Forums)

Why your snippet showed fp32

You likely printed outside the autocast block or inspected an op that remains fp32 under AMP. That matches the 4.35+ change where CUDA autocast moved under Accelerate instead of compute_loss_context_manager. Issues #27760 and #28512 document this migration and confusion. Dates: 2023-11-29 and 2024-01-15. (GitHub)

Minimal checklist

bf16=True in TrainingArguments.
Accelerate installed and current; let Trainer drive accelerator.autocast. (Hugging Face)
Verify at runtime with torch.is_autocast_enabled() and torch.get_autocast_gpu_dtype(). Expect bfloat16. (PyTorch Docs)
If using DeepSpeed, add the "bf16" block. (Hugging Face)

Curated references

GitHub issues

CUDA AMP handled by Accelerate, not autocast_smart_context_manager (behavior change in 4.35.x). Nov 29, 2023; Jan 15, 2024. (GitHub)

Hugging Face docs

Trainer guide, explicitly “powered by Accelerate.” accessed 2025-10-12. (Hugging Face)
Accelerate Accelerator: mixed_precision and autocast API. current page. (Hugging Face)
DeepSpeed integration: bf16 JSON section and no loss scaling. v4.35.0 page. (Hugging Face)
Performance notes on fp16/bf16 and LayerNorm fp32 behavior. v4.17.0 page. (Hugging Face)

PyTorch

AMP docs and checks (is_autocast_enabled, get_autocast_gpu_dtype). updated 2025-06-12. (PyTorch Docs)

If you paste your exact TrainingArguments and whether you use DeepSpeed or FSDP, I’ll point to the specific setting that’s masking bf16 in your run.

Zoe0427 · October 13, 2025, 1:01am

Thank you, John!

When I run accelerate launch --miexed-precision bf16 my_code.py , it works!

Topic		Replies	Views
Bfloat16 conversion results in significantly slower computation for various transformer models 🤗Transformers	0	1441	December 20, 2021
Training Arguments to do pure bf16 training? 🤗Transformers	0	2123	December 20, 2023
Mixed precision for bfloat16-pretrained models 🤗Transformers	2	12546	April 21, 2021
Does using FP16 help accelerate generation? (HuggingFace BART) 🤗Transformers	2	5759	September 30, 2020
Training GPT2 on CPUs? 🤗Transformers	4	1690	October 17, 2020