Trainer with model modernbert uses only one gpu

ItaiP24 · September 17, 2025, 9:26pm

Hello,

I try to train a ‘modernbert’ model with a trainer. I have 4 available gpus.

Yet when I train I see through nvidia-smi that only one gpu is utilized.

If I change the model to ‘bert’ and don’t change anything else (same environment etc )

then I get that all 4 gpus are utilized.

Did anybody else faced this problem? Does anybody know a solution?

Thanks very much for your help.

John6666 · September 17, 2025, 9:31pm

It seems that one reason is that ModernBert uses DP.

It’s a known ModernBERT quirk. By default ModernBERT enables torch.compile in its “reference” path and often gets launched with PyTorch DataParallel (DP) if you run Trainer on a multi-GPU box without accelerate/torchrun. DP + ModernBERT’s compiled modules tends to break or silently fall back to a single process, so you see only GPU-0 active. The fix is to run DDP and, if needed, disable the compile path and FlashAttention. (Hugging Face)

Do this:

Upgrade and load the model safely

Use Transformers ≥ 4.48.0. (Hugging Face)
Disable the compile path and force a safe attention impl if you hit issues:

from transformers import AutoModelForMaskedLM, AutoTokenizer
mid = "answerdotai/ModernBERT-base"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForMaskedLM.from_pretrained(
    mid,
    reference_compile=False,           # avoid FX/compile conflicts
    attn_implementation="eager",       # or "sdpa"; avoid FA2 if unsupported
    torch_dtype="bfloat16",            # if your GPUs support bf16
)

(Hugging Face)

Launch with true DDP (not DP)
Pick one of these. Both spawn one process per GPU so all 4 get used.

# Accelerate
accelerate launch --num_processes 4 train.py ...

# torch.distributed
torchrun --standalone --nproc_per_node 4 train.py ...

(Hugging Face)

If you really want FlashAttention 2, ensure hardware support; otherwise keep attn_implementation="eager" or "sdpa". (Hugging Face)

Notes and references

The ModernBERT community thread and maintainers recommend reference_compile=False and using accelerate/torchrun instead of DP; the FX-trace error is the common symptom. (Hugging Face)
A repo issue documents multi-GPU trouble with the stock example; DP was the trigger and DDP was the workaround. (GitHub)
DDP is generally preferred over DP for multi-GPU training. (Sbert)

If this still shows only one GPU, check: you aren’t masking devices via CUDA_VISIBLE_DEVICES, your dataset isn’t tiny, and per_device_train_batch_size × world_size isn’t collapsing to 1 on each rank.

ItaiP24 · September 17, 2025, 9:59pm

thanks,

I will do that

Topic		Replies	Views
Training using multiple GPUs Beginners	20	20153	February 25, 2024
Trainer use multigpu 🤗Transformers	0	504	July 29, 2021
Model not copied to multiple GPUs when using DDP (using trainer) 🤗Accelerate	2	673	February 5, 2024
Why is Trainer only using 1 (not 4) GPUs? Beginners	1	1622	June 2, 2022
Trainer is not using multiple GPUs in the DP setup Beginners	0	827	April 9, 2023

Trainer with model modernbert uses only one gpu

Related topics