Since there seem to be similar cases, it probably isn’t a bug…?
Your loss goes negative because training is numerically unstable after switching to 800/1333 with a tiny batch. The fix is mechanical: (1) lower the learning rate to match the small global batch, (2) enable strong gradient clipping, (3) make padding deterministic across the batch, and (4) keep similar aspect ratios together. Do those first, then re-enable mixed precision.
What changed and why it breaks
- Resolution jump → larger gradients. Resizing to shortest-edge 800 and longest-edge 1333 is standard in Detectron2/DETR-style configs, but it increases effective token count and gradient magnitudes vs your square baseline. (GitHub)
- LR too high for your global batch. Mask2Former reference configs use BASE_LR=1e-4 at IMS_PER_BATCH=16 with AdamW. If your total batch is 2, linear scaling gives ~1.25e-5, not ~4e-4 shown in your logs. Oversized LR + higher resolution → exploding updates. (Hugging Face)
- Padding changed across batches. By default,
Mask2FormerImageProcessor pads each image to the largest H×W in the batch and returns a pixel_mask. That means the same sample can see different padded canvases depending on its batchmate, which shifts convolutions and attention masks and can destabilize early training. Use a fixed pad_size. (Hugging Face)
- Mixed precision makes overflows visible. With fp16/bf16, large logits or gradients can produce NaNs/−Inf, then the loss prints huge negative numbers. Confirm in fp32, then turn AMP back on. (PyTorch Docs)
- “Negative loss” itself is a symptom. BCE-style terms are non-negative under correct usage; negative values typically mean overflow or misuse (e.g., wrong argument order). In segmentation heads you’re seeing overflow. (PyTorch Forums)
Immediate stabilization checklist
Apply all four levers together first, then relax one at a time.
- Scale LR to batch size.
- Reference:
BASE_LR=1e-4 at global batch 16. For global batch 2: 1e-4 * (2/16) = 1.25e-5. If you want the same “effective” batch, use gradient accumulation to reach 16 before raising LR. Linear scaling is the standard heuristic. (Hugging Face)
- Enable strong gradient clipping.
- Mask2Former configs: full-model L2 clipping, value 0.01, AdamW, weight decay 0.05, AMP enabled. Keep that. (Hugging Face)
- Fix padding size and keep divisibility.
- Set a fixed canvas so batch composition doesn’t alter tensors: for 800/1333 use 832×1344 (next multiples of 32), consistent with typical
size_divisor=32. (Hugging Face)
- Group by aspect ratio (or use batch=1).
- Standard practice in Detectron2/MMDetection: aspect-ratio grouping reduces extreme padding and variance across the batch. (detectron2.readthedocs.io)
- Sanity check AMP.
- Do a short fp32 run to verify that loss stays positive and decreases. Then re-enable AMP with GradScaler. (PyTorch Docs)
Concrete code patches
Use either HF Trainer knobs or plain PyTorch. Two minimal examples.
A. Hugging Face Trainer
# refs:
# - Base configs (IMS_PER_BATCH=16, BASE_LR=1e-4, clip=0.01, AdamW, AMP):
# https://huggingface.co/spaces/akhaliq/Mask2Former/blob/16aee.../configs/coco/instance-segmentation/Base-COCO-InstanceSegmentation.yaml
# https://huggingface.co/spaces/akhaliq/Mask2Former/blob/ac0cd.../configs/ade20k/instance-segmentation/Base-ADE20K-InstanceSegmentation.yaml
from transformers import TrainingArguments
args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch ≈ 16
learning_rate=1.25e-5, # 1e-4 * (2/16)
weight_decay=0.05,
max_grad_norm=0.01, # full-model clip ~0.01
lr_scheduler_type="polynomial", # poly schedule like WarmupPolyLR
warmup_steps=0,
fp16=True, # switch to False for a short fp32 sanity check
)
# Fix padding inside your collator call; keep size_divisor=32
# refs:
# - pad_size behavior: https://huggingface.co/docs/transformers/en/model_doc/mask2former
# - pad size divisor 32 as a common convention: https://mmdetection.readthedocs.io/en/dev-3.x/user_guides/config.html
result = image_processor(
[it["image"] for it in batch],
[it["instance_seg"] for it in batch],
instance_id_to_semantic_id=[it["inst2class"] for it in batch],
size={"shortest_edge": 800, "longest_edge": 1333},
pad_size={"height": 832, "width": 1344}, # fixed canvas, multiples of 32
return_tensors="pt",
)
B. Plain PyTorch training loop
# refs:
# - AMP guidance: https://pytorch.org/docs/stable/amp.html
# - Clip to 0.01 like Mask2Former configs: links above
scaler = torch.cuda.amp.GradScaler(enabled=True)
for step, batch in enumerate(loader):
optimizer.zero_grad(set_to_none=True)
with torch.cuda.amp.autocast(enabled=True):
out = model(**batch)
loss = out.loss
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.01) # keep small
scaler.step(optimizer)
scaler.update()
Dataset and loader details that matter
- Resize policy: “ResizeShortestEdge(800, max_size=1333)” is the canonical policy for detection/instance segmentation stacks; it preserves aspect ratio and caps the long side. It is widely used by Detectron2 and DETR-family baselines. (GitHub)
- Aspect-ratio grouping: Enable it or emulate it in your sampler to keep portrait with portrait and landscape with landscape. (detectron2.readthedocs.io)
- Divisibility: Keep heights and widths divisible by 32 for FPN/pixel-decoder strides; most configs enforce
pad_size_divisor=32. (mmdetection.readthedocs.io)
Quick diagnostics
- 8-image overfit test: batch=1, fp32, LR=1e-5, fixed
pad_size. Expect strictly positive loss decreasing. If it passes, your data and labels are fine; the old run was numeric.
- Log padding ratio: mean of
pixel_mask per batch; if <0.5 often, you are training on mostly padding. Group by aspect ratio. (Hugging Face)
- Watch grad norms: should be O(1–10) after clipping. Multi-million norms = divergence.
Common pitfalls to avoid
- Using high LR at tiny batch. Baselines assume 1e-4 at batch 16; at batch 2 keep ≈1e-5. (Hugging Face)
- Dropping gradient clipping. Keep 0.01 L2 clip like the configs. (Hugging Face)
- Floating-point traps in AMP. Validate in fp32 first, then re-enable AMP. (PyTorch Docs)
- Unstable padding. Default “pad to largest in batch” changes per batch; fix with
pad_size. (Hugging Face)
- Misinterpreting negative loss. It signals overflow or misuse, not a valid objective value. (PyTorch Forums)
Why repeating: the four levers again
Lower LR to ~1.25e-5, clip to 0.01, fix pad_size to 832×1344, group by aspect ratio. Then try AMP again. Those four stabilize Mask2Former at 800/1333 with batch 2. (Hugging Face)
Curated resources and similar cases
Baseline configs and what to copy
- COCO/ADE configs with
IMS_PER_BATCH=16, BASE_LR=1e-4, ADAMW, weight_decay=0.05, CLIP_VALUE=0.01, AMP.ENABLED=True. Good canonical references. (Hugging Face)
- Issue thread showing these exact solver keys, including clipping and batch size 16. Useful for quick sanity checks. (GitHub)
Processor, padding, divisibility
Mask2FormerImageProcessor docs: pad_size and default “pad to largest in batch.” Use it to freeze the canvas. (Hugging Face)
- MMDetection config docs:
pad_size_divisor=32 rationale. (mmdetection.readthedocs.io)
Batching by aspect ratio
Mixed precision and instability
- PyTorch AMP: NaN/Inf guidance and GradScaler. (PyTorch Docs)
Negative BCE loss reports for context
- PyTorch Forums and StackOverflow threads on negative BCEWithLogitsLoss when misused; useful to recognize the symptom. (PyTorch Forums)
800/1333 resize policy references
- Detectron2 issues and DETR benchmarks referencing
ResizeShortestEdge(800, max_size=1333). (GitHub)
Drop-in snippet you can use now
# Learning rate and accumulation:
# Base: 1e-4 @ global batch 16 (configs)
# Scale: 1e-4 * (2/16) = 1.25e-5 (linear scaling)
# refs: configs: https://huggingface.co/spaces/akhaliq/Mask2Former/.../Base-COCO-InstanceSegmentation.yaml
# scaling rule: https://arxiv.org/abs/1706.02677
training_args = dict(
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=1.25e-5,
weight_decay=0.05,
max_grad_norm=0.01, # keep small, per Mask2Former configs
lr_scheduler_type="polynomial",
fp16=True,
)
# Collator: freeze padding and keep divisor 32
# refs: pad_size behavior: https://huggingface.co/docs/transformers/en/model_doc/mask2former
# pad_size_divisor: https://mmdetection.readthedocs.io/en/dev-3.x/user_guides/config.html
def collate(batch):
return image_processor(
[x["image"] for x in batch],
[x["instance_seg"] for x in batch],
instance_id_to_semantic_id=[x["inst2class"] for x in batch],
size={"shortest_edge": 800, "longest_edge": 1333},
pad_size={"height": 832, "width": 1344},
return_tensors="pt",
)