Trainer with model modernbert uses only one gpu

Hello,

I try to train a ‘modernbert’ model with a trainer. I have 4 available gpus.

Yet when I train I see through nvidia-smi that only one gpu is utilized.

If I change the model to ‘bert’ and don’t change anything else (same environment etc )

then I get that all 4 gpus are utilized.

Did anybody else faced this problem? Does anybody know a solution?

Thanks very much for your help.

1 Like

It seems that one reason is that ModernBert uses DP.


It’s a known ModernBERT quirk. By default ModernBERT enables torch.compile in its “reference” path and often gets launched with PyTorch DataParallel (DP) if you run Trainer on a multi-GPU box without accelerate/torchrun. DP + ModernBERT’s compiled modules tends to break or silently fall back to a single process, so you see only GPU-0 active. The fix is to run DDP and, if needed, disable the compile path and FlashAttention. (Hugging Face)

Do this:

  1. Upgrade and load the model safely
  • Use Transformers ≥ 4.48.0. (Hugging Face)
  • Disable the compile path and force a safe attention impl if you hit issues:
from transformers import AutoModelForMaskedLM, AutoTokenizer
mid = "answerdotai/ModernBERT-base"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForMaskedLM.from_pretrained(
    mid,
    reference_compile=False,           # avoid FX/compile conflicts
    attn_implementation="eager",       # or "sdpa"; avoid FA2 if unsupported
    torch_dtype="bfloat16",            # if your GPUs support bf16
)

(Hugging Face)

  1. Launch with true DDP (not DP)
    Pick one of these. Both spawn one process per GPU so all 4 get used.
# Accelerate
accelerate launch --num_processes 4 train.py ...

# torch.distributed
torchrun --standalone --nproc_per_node 4 train.py ...

(Hugging Face)

  1. If you really want FlashAttention 2, ensure hardware support; otherwise keep attn_implementation="eager" or "sdpa". (Hugging Face)

Notes and references

  • The ModernBERT community thread and maintainers recommend reference_compile=False and using accelerate/torchrun instead of DP; the FX-trace error is the common symptom. (Hugging Face)
  • A repo issue documents multi-GPU trouble with the stock example; DP was the trigger and DDP was the workaround. (GitHub)
  • DDP is generally preferred over DP for multi-GPU training. (Sbert)

If this still shows only one GPU, check: you aren’t masking devices via CUDA_VISIBLE_DEVICES, your dataset isn’t tiny, and per_device_train_batch_size × world_size isn’t collapsing to 1 on each rank.

thanks,

I will do that

1 Like