Mask2former negative loss when using non-square images

gpldecah · November 9, 2025, 7:12pm

I have been using mask2former with some success. This was with square images and I started with a loss around 250 and finished after 40 epochs (4200 images) with a loss of 70.
I then went on to try and change the image_processor to use size: {“shortest_edge”: 800, “longest_edge”: 1333}

The training seems to degenerate completely (batch size 2 used)

{‘loss’: -36547.7063, ‘grad_norm’: 1745866.125, ‘learning_rate’: 0.000398530612244898, ‘epoch’: 0.02}
{‘loss’: -817441.0, ‘grad_norm’: 2542171.0, ‘learning_rate’: 0.0003968979591836735, ‘epoch’: 0.04}
{‘loss’: -5644787.2, ‘grad_norm’: 16560983.0, ‘learning_rate’: 0.000395265306122449, ‘epoch’: 0.06}
{‘loss’: -19530827.2, ‘grad_norm’: 2828034.25, ‘learning_rate’: 0.0003936326530612245, ‘epoch’: 0.08}
{‘loss’: -56547244.8, ‘grad_norm’: 34505864.0, ‘learning_rate’: 0.000392, ‘epoch’: 0.1}
{‘loss’: -124187980.8, ‘grad_norm’: 144603888.0, ‘learning_rate’: 0.00039036734693877553, ‘epoch’: 0.12}
{‘loss’: -207948672.0, ‘grad_norm’: 33010758.0, ‘learning_rate’: 0.000388734693877551, ‘epoch’: 0.14}
{‘loss’: -484045926.4, ‘grad_norm’: 241787776.0, ‘learning_rate’: 0.0003871020408163265, ‘epoch’: 0.16}
{‘loss’: -760237568.0, ‘grad_norm’: 210816480.0, ‘learning_rate’: 0.00038546938775510205, ‘epoch’: 0.18}
{‘loss’: -1112103731.2, ‘grad_norm’: 747893376.0, ‘learning_rate’: 0.00038383673469387754, ‘epoch’: 0.2}
{‘loss’: -2247609139.2, ‘grad_norm’: 815875392.0, ‘learning_rate’: 0.0003822040816326531, ‘epoch’: 0.22}
{‘loss’: -2431668633.6, ‘grad_norm’: 867899072.0, ‘learning_rate’: 0.0003805714285714286, ‘epoch’: 0.24}
{‘loss’: -2353252761.6, ‘grad_norm’: 1711129472.0, ‘learning_rate’: 0.0003789387755102041, ‘epoch’: 0.27}
{‘loss’: -3690597580.8, ‘grad_norm’: 1073970304.0, ‘learning_rate’: 0.0003773061224489796, ‘epoch’: 0.29}
{‘loss’: -4846919680.0, ‘grad_norm’: 193734288.0, ‘learning_rate’: 0.00037567346938775515, ‘epoch’: 0.31}
{‘loss’: -8446751539.2, ‘grad_norm’: 733409024.0, ‘learning_rate’: 0.00037404081632653064, ‘epoch’: 0.33}
{‘loss’: -12243546112.0, ‘grad_norm’: 423252320.0, ‘learning_rate’: 0.00037240816326530613, ‘epoch’: 0.35}
{‘loss’: -13329310515.2, ‘grad_norm’: 2776371968.0, ‘learning_rate’: 0.0003707755102040817, ‘epoch’: 0.37}
{‘loss’: -12851252428.8, ‘grad_norm’: 4021588736.0, ‘learning_rate’: 0.00036914285714285716, ‘epoch’: 0.39}
{‘loss’: -22274827878.4, ‘grad_norm’: 5769148416.0, ‘learning_rate’: 0.0003675102040816327, ‘epoch’: 0.41}
{‘loss’: -14297015910.4, ‘grad_norm’: 1780796800.0, ‘learning_rate’: 0.0003658775510204082, ‘epoch’: 0.43}
{‘loss’: -34128782950.4, ‘grad_norm’: 3467620864.0, ‘learning_rate’: 0.0003642448979591837, ‘epoch’: 0.45}

I am letting Mask2FormerImageProcessor handle the batches in my collate_fn:

result = self.image_processor(
    [item["image"] for item in batch],
    [item["instance_seg"] for item in batch],
    instance_id_to_semantic_id=[item["inst2class"] for item in batch],
    return_tensors="pt",
)

Does anyone have experience using mask2former with non-square image parameters ? This seem very odd.

FYI the params of the image processor:

image_processor: Mask2FormerImageProcessor {
“do_normalize”: true,
“do_reduce_labels”: false,
“do_rescale”: true,
“do_resize”: true,
“ignore_index”: 255,
“image_mean”: [
0.48500001430511475,
0.4560000002384186,
0.4059999883174896
],
“image_processor_type”: “Mask2FormerImageProcessor”,
“image_std”: [
0.2290000021457672,
0.2239999920129776,
0.22499999403953552
],
“num_labels”: 150,
“pad_size”: null,
“resample”: 2,
“rescale_factor”: 0.00392156862745098,
“size”: {
“longest_edge”: 1333,
“shortest_edge”: 800
},
“size_divisor”: 32
}

John6666 · November 10, 2025, 12:38am

Since there seem to be similar cases, it probably isn’t a bug…?

Your loss goes negative because training is numerically unstable after switching to 800/1333 with a tiny batch. The fix is mechanical: (1) lower the learning rate to match the small global batch, (2) enable strong gradient clipping, (3) make padding deterministic across the batch, and (4) keep similar aspect ratios together. Do those first, then re-enable mixed precision.

What changed and why it breaks

Resolution jump → larger gradients. Resizing to shortest-edge 800 and longest-edge 1333 is standard in Detectron2/DETR-style configs, but it increases effective token count and gradient magnitudes vs your square baseline. (GitHub)
LR too high for your global batch. Mask2Former reference configs use BASE_LR=1e-4 at IMS_PER_BATCH=16 with AdamW. If your total batch is 2, linear scaling gives ~1.25e-5, not ~4e-4 shown in your logs. Oversized LR + higher resolution → exploding updates. (Hugging Face)
Padding changed across batches. By default, Mask2FormerImageProcessor pads each image to the largest H×W in the batch and returns a pixel_mask. That means the same sample can see different padded canvases depending on its batchmate, which shifts convolutions and attention masks and can destabilize early training. Use a fixed pad_size. (Hugging Face)
Mixed precision makes overflows visible. With fp16/bf16, large logits or gradients can produce NaNs/−Inf, then the loss prints huge negative numbers. Confirm in fp32, then turn AMP back on. (PyTorch Docs)
“Negative loss” itself is a symptom. BCE-style terms are non-negative under correct usage; negative values typically mean overflow or misuse (e.g., wrong argument order). In segmentation heads you’re seeing overflow. (PyTorch Forums)

Immediate stabilization checklist

Apply all four levers together first, then relax one at a time.

Scale LR to batch size.

Reference: BASE_LR=1e-4 at global batch 16. For global batch 2: 1e-4 * (2/16) = 1.25e-5. If you want the same “effective” batch, use gradient accumulation to reach 16 before raising LR. Linear scaling is the standard heuristic. (Hugging Face)

Enable strong gradient clipping.

Mask2Former configs: full-model L2 clipping, value 0.01, AdamW, weight decay 0.05, AMP enabled. Keep that. (Hugging Face)

Fix padding size and keep divisibility.

Set a fixed canvas so batch composition doesn’t alter tensors: for 800/1333 use 832×1344 (next multiples of 32), consistent with typical size_divisor=32. (Hugging Face)

Group by aspect ratio (or use batch=1).

Standard practice in Detectron2/MMDetection: aspect-ratio grouping reduces extreme padding and variance across the batch. (detectron2.readthedocs.io)

Sanity check AMP.

Do a short fp32 run to verify that loss stays positive and decreases. Then re-enable AMP with GradScaler. (PyTorch Docs)

Concrete code patches

Use either HF Trainer knobs or plain PyTorch. Two minimal examples.

A. Hugging Face Trainer

# refs:
# - Base configs (IMS_PER_BATCH=16, BASE_LR=1e-4, clip=0.01, AdamW, AMP): 
#   https://huggingface.co/spaces/akhaliq/Mask2Former/blob/16aee.../configs/coco/instance-segmentation/Base-COCO-InstanceSegmentation.yaml
#   https://huggingface.co/spaces/akhaliq/Mask2Former/blob/ac0cd.../configs/ade20k/instance-segmentation/Base-ADE20K-InstanceSegmentation.yaml
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch ≈ 16
    learning_rate=1.25e-5,           # 1e-4 * (2/16)
    weight_decay=0.05,
    max_grad_norm=0.01,              # full-model clip ~0.01
    lr_scheduler_type="polynomial",  # poly schedule like WarmupPolyLR
    warmup_steps=0,
    fp16=True,                       # switch to False for a short fp32 sanity check
)

# Fix padding inside your collator call; keep size_divisor=32
# refs:
# - pad_size behavior: https://huggingface.co/docs/transformers/en/model_doc/mask2former
# - pad size divisor 32 as a common convention: https://mmdetection.readthedocs.io/en/dev-3.x/user_guides/config.html
result = image_processor(
    [it["image"] for it in batch],
    [it["instance_seg"] for it in batch],
    instance_id_to_semantic_id=[it["inst2class"] for it in batch],
    size={"shortest_edge": 800, "longest_edge": 1333},
    pad_size={"height": 832, "width": 1344},   # fixed canvas, multiples of 32
    return_tensors="pt",
)

B. Plain PyTorch training loop

# refs:
# - AMP guidance: https://pytorch.org/docs/stable/amp.html
# - Clip to 0.01 like Mask2Former configs: links above

scaler = torch.cuda.amp.GradScaler(enabled=True)

for step, batch in enumerate(loader):
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast(enabled=True):
        out = model(**batch)
        loss = out.loss
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 0.01)  # keep small
    scaler.step(optimizer)
    scaler.update()

Dataset and loader details that matter

Resize policy: “ResizeShortestEdge(800, max_size=1333)” is the canonical policy for detection/instance segmentation stacks; it preserves aspect ratio and caps the long side. It is widely used by Detectron2 and DETR-family baselines. (GitHub)
Aspect-ratio grouping: Enable it or emulate it in your sampler to keep portrait with portrait and landscape with landscape. (detectron2.readthedocs.io)
Divisibility: Keep heights and widths divisible by 32 for FPN/pixel-decoder strides; most configs enforce pad_size_divisor=32. (mmdetection.readthedocs.io)

Quick diagnostics

8-image overfit test: batch=1, fp32, LR=1e-5, fixed pad_size. Expect strictly positive loss decreasing. If it passes, your data and labels are fine; the old run was numeric.
Log padding ratio: mean of pixel_mask per batch; if <0.5 often, you are training on mostly padding. Group by aspect ratio. (Hugging Face)
Watch grad norms: should be O(1–10) after clipping. Multi-million norms = divergence.

Common pitfalls to avoid

Using high LR at tiny batch. Baselines assume 1e-4 at batch 16; at batch 2 keep ≈1e-5. (Hugging Face)
Dropping gradient clipping. Keep 0.01 L2 clip like the configs. (Hugging Face)
Floating-point traps in AMP. Validate in fp32 first, then re-enable AMP. (PyTorch Docs)
Unstable padding. Default “pad to largest in batch” changes per batch; fix with pad_size. (Hugging Face)
Misinterpreting negative loss. It signals overflow or misuse, not a valid objective value. (PyTorch Forums)

Why repeating: the four levers again

Lower LR to ~1.25e-5, clip to 0.01, fix pad_size to 832×1344, group by aspect ratio. Then try AMP again. Those four stabilize Mask2Former at 800/1333 with batch 2. (Hugging Face)

Curated resources and similar cases

Baseline configs and what to copy

COCO/ADE configs with IMS_PER_BATCH=16, BASE_LR=1e-4, ADAMW, weight_decay=0.05, CLIP_VALUE=0.01, AMP.ENABLED=True. Good canonical references. (Hugging Face)
Issue thread showing these exact solver keys, including clipping and batch size 16. Useful for quick sanity checks. (GitHub)

Processor, padding, divisibility

Mask2FormerImageProcessor docs: pad_size and default “pad to largest in batch.” Use it to freeze the canvas. (Hugging Face)
MMDetection config docs: pad_size_divisor=32 rationale. (mmdetection.readthedocs.io)

Batching by aspect ratio

Detectron2: aspect-ratio grouping flags and docs. (detectron2.readthedocs.io)
MMDetection: AspectRatioBatchSampler. (mmdetection.readthedocs.io)

Mixed precision and instability

PyTorch AMP: NaN/Inf guidance and GradScaler. (PyTorch Docs)

Negative BCE loss reports for context

PyTorch Forums and StackOverflow threads on negative BCEWithLogitsLoss when misused; useful to recognize the symptom. (PyTorch Forums)

800/1333 resize policy references

Detectron2 issues and DETR benchmarks referencing ResizeShortestEdge(800, max_size=1333). (GitHub)

Drop-in snippet you can use now

# Learning rate and accumulation:
#   Base: 1e-4 @ global batch 16  (configs)
#   Scale: 1e-4 * (2/16) = 1.25e-5  (linear scaling)
# refs: configs: https://huggingface.co/spaces/akhaliq/Mask2Former/.../Base-COCO-InstanceSegmentation.yaml
#       scaling rule: https://arxiv.org/abs/1706.02677
training_args = dict(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1.25e-5,
    weight_decay=0.05,
    max_grad_norm=0.01,     # keep small, per Mask2Former configs
    lr_scheduler_type="polynomial",
    fp16=True,
)

# Collator: freeze padding and keep divisor 32
# refs: pad_size behavior: https://huggingface.co/docs/transformers/en/model_doc/mask2former
#       pad_size_divisor: https://mmdetection.readthedocs.io/en/dev-3.x/user_guides/config.html
def collate(batch):
    return image_processor(
        [x["image"] for x in batch],
        [x["instance_seg"] for x in batch],
        instance_id_to_semantic_id=[x["inst2class"] for x in batch],
        size={"shortest_edge": 800, "longest_edge": 1333},
        pad_size={"height": 832, "width": 1344},
        return_tensors="pt",
    )

Topic		Replies	Views
ValueError - number of spatial dimensions Intermediate	0	319	January 19, 2023
Mask2Former on multi-gpu cuda 🤗Transformers	0	173	November 27, 2023
Mask2Former IoU a lot worse than Maskformer's IoU on same dataset 🤗Transformers	2	553	July 10, 2024
Finetune Mask2former 🤗Transformers	0	365	January 27, 2023
Mask2former setup for binary segmentation Beginners	6	743	August 5, 2024