It seems that one reason is that ModernBert uses DP.
It’s a known ModernBERT quirk. By default ModernBERT enables torch.compile in its “reference” path and often gets launched with PyTorch DataParallel (DP) if you run Trainer on a multi-GPU box without accelerate/torchrun. DP + ModernBERT’s compiled modules tends to break or silently fall back to a single process, so you see only GPU-0 active. The fix is to run DDP and, if needed, disable the compile path and FlashAttention. (Hugging Face)
Disable the compile path and force a safe attention impl if you hit issues:
from transformers import AutoModelForMaskedLM, AutoTokenizer
mid = "answerdotai/ModernBERT-base"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForMaskedLM.from_pretrained(
mid,
reference_compile=False, # avoid FX/compile conflicts
attn_implementation="eager", # or "sdpa"; avoid FA2 if unsupported
torch_dtype="bfloat16", # if your GPUs support bf16
)
If you really want FlashAttention 2, ensure hardware support; otherwise keep attn_implementation="eager" or "sdpa". (Hugging Face)
Notes and references
The ModernBERT community thread and maintainers recommend reference_compile=False and using accelerate/torchrun instead of DP; the FX-trace error is the common symptom. (Hugging Face)
A repo issue documents multi-GPU trouble with the stock example; DP was the trigger and DDP was the workaround. (GitHub)
DDP is generally preferred over DP for multi-GPU training. (Sbert)
If this still shows only one GPU, check: you aren’t masking devices via CUDA_VISIBLE_DEVICES, your dataset isn’t tiny, and per_device_train_batch_size × world_size isn’t collapsing to 1 on each rank.