OneFormer ID/Labels for FineTuning

Hello Forums,

TLDR: Trying to fine-tune OneFormer with a dataset. Doing it soley off of providing ground truth and masks works fine but it classifies the segmented things wrong. Trying to add the id2label did not work as I have gotten an assertion sizes out of bounds issue(Could be implementing it wrong).

I was following a method to fine-tune along with classes and this is how I tried to do it. I have two seperate json files, a id2label and a label2id, both contain the classes and ids present in the new dataset. In my train file, I call this method


with open("labels/id2label.json", "r") as f:
   id2label = json.load(f)
id2label = {int(k): v for k, v in id2label.items()}
label2id = {v: k for k, v in id2label.items()}

along with my config:

config = OneFormerConfig.from_pretrained("model link", id2label=id2label , label2id = label2id, is_training=True)

Is there something I am doing wrong with OneFormer? I was aiming to have something like the maskformer finetune (Source 1).

Sources:

  1. Fine Tuning Mask2Former on Custom Dataset
  2. Fine-Tune a Semantic Segmentation Model with a Custom Dataset
  3. Nile Rogers Finetune for both Oneformer and Mask2Former
1 Like

Seems it’s caused by is_training=True with OneFormer…?

I’ve taken a look at the Issues linked but neither have helped solve my issue. I am wondering how first Issue was resolved solely on if is_training=True argument. It is already enabled in my version. Here is a snippet of my attempt:

import json
with open("labels/id2label.json", "r") as f:
    id2label = json.load(f)
id2label = {int(k): v for k, v in id2label.items()}
label2id = {v: k for k, v in id2label.items()}
num_labels = len(id2label)
processor = OneFormerProcessor.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
model = OneFormerForUniversalSegmentation.from_pretrained(
    "shi-labs/oneformer_ade20k_swin_tiny",
    id2label=id2label,
    label2id=label2id,
    num_labels = num_labels,
    is_training=True,
    ignore_mismatched_sizes=True,
)
model.config.use_contrastive_loss = True
processor.image_processor.num_text = (
    model.config.num_queries - model.config.text_encoder_n_ctx
)

Without the id2label and the inverse arguments, it does fine-tune on the dataset, but with it causes an issue.

UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16,0,0], thread: [32,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16,0,0], thread: [33,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

1 Like

Hmm… It worked… Could it be that the JSON content is in a label format that PyTorch does not support?

from transformers import OneFormerProcessor, OneFormerForUniversalSegmentation
#import json
#with open("labels/id2label.json", "r") as f:
#    id2label = json.load(f)
#id2label = {int(k): v for k, v in id2label.items()}
id2label = {0: "zero", 1: "one"}
label2id = {v: int(k) for k, v in id2label.items()}
num_labels = len(id2label)
processor = OneFormerProcessor.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
model = OneFormerForUniversalSegmentation.from_pretrained(
    "shi-labs/oneformer_ade20k_swin_tiny",
    id2label=id2label,
    label2id=label2id,
    num_labels = num_labels,
    is_training=True,
    ignore_mismatched_sizes=True,
)
model.config.use_contrastive_loss = True
processor.image_processor.num_text = (
    model.config.num_queries - model.config.text_encoder_n_ctx
)
print(model)
print(processor)

I have made the id2label.json file as something like this.

{
  "Background": 0,
  "Road": 1
}

The format which matches the one provided by the mask2former segmentation tutorial: https://huggingface.co/datasets/segments/sidewalk-semantic/blob/main/id2label.json

What did you mean by it worked on your end? Did my snippet run on your machine?

1 Like

What did you mean by it worked on your end? Did my snippet run on your machine?

Yes. Yes.

Oh boy. Can you give me a general idea on what your system is? I’m running a rtx4090 so I’ve never thought of it being an issue. Unless it is a dependency issue I’m unaware of.

1 Like

Yeah. My env is Windows (raw), Python 3.9 (raw), GeForce RTX 3060Ti 8GB.

accelerate                1.8.1
bitsandbytes              0.45.1
hf-xet                    1.1.5
huggingface-hub           0.33.0
numpy                     1.23.5
peft                      0.14.0
pydantic                  2.10.6
torch                     2.4.0+cu124
torchaudio                2.4.0+cu124
torchvision               0.19.0+cu124
transformers              4.46.3

What does your dataloader look like to test it?

1 Like

Hmm? What dataloader?

Some weights of OneFormerForUniversalSegmentation were not initialized from the model checkpoint at shi-labs/oneformer_ade20k_swin_tiny and are newly initialized: ['model.text_mapper.prompt_ctx.weight', 'model.text_mapper.text_encoder.ln_final.bias', 'model.text_mapper.text_encoder.ln_final.weight', 'model.text_mapper.text_encoder.positional_embedding', 'model.text_mapper.text_encoder.token_embedding.weight', 'model.text_mapper.text_encoder.transformer.layers.0.layer_norm1.bias', 'model.text_mapper.text_encoder.transformer.layers.0.layer_norm1.weight', 'model.text_mapper.text_encoder.transformer.layers.0.layer_norm2.bias', 'model.text_mapper.text_encoder.transformer.layers.0.layer_norm2.weight', 'model.text_mapper.text_encoder.transformer.layers.0.mlp.fc1.bias', 'model.text_mapper.text_encoder.transformer.layers.0.mlp.fc1.weight', 'model.text_mapper.text_encoder.transformer.layers.0.mlp.fc2.bias', 'model.text_mapper.text_encoder.transformer.layers.0.mlp.fc2.weight', 'model.text_mapper.text_encoder.transformer.layers.0.self_attn.in_proj_bias', 'model.text_mapper.text_encoder.transformer.layers.0.self_attn.in_proj_weight', 'model.text_mapper.text_encoder.transformer.layers.0.self_attn.out_proj.bias', 'model.text_mapper.text_encoder.transformer.layers.0.self_attn.out_proj.weight', 'model.text_mapper.text_encoder.transformer.layers.1.layer_norm1.bias', 'model.text_mapper.text_encoder.transformer.layers.1.layer_norm1.weight', 'model.text_mapper.text_encoder.transformer.layers.1.layer_norm2.bias', 'model.text_mapper.text_encoder.transformer.layers.1.layer_norm2.weight', 'model.text_mapper.text_encoder.transformer.layers.1.mlp.fc1.bias', 'model.text_mapper.text_encoder.transformer.layers.1.mlp.fc1.weight', 'model.text_mapper.text_encoder.transformer.layers.1.mlp.fc2.bias', 'model.text_mapper.text_encoder.transformer.layers.1.mlp.fc2.weight', 'model.text_mapper.text_encoder.transformer.layers.1.self_attn.in_proj_bias', 'model.text_mapper.text_encoder.transformer.layers.1.self_attn.in_proj_weight', 'model.text_mapper.text_encoder.transformer.layers.1.self_attn.out_proj.bias', 'model.text_mapper.text_encoder.transformer.layers.1.self_attn.out_proj.weight', 'model.text_mapper.text_encoder.transformer.layers.2.layer_norm1.bias', 'model.text_mapper.text_encoder.transformer.layers.2.layer_norm1.weight', 'model.text_mapper.text_encoder.transformer.layers.2.layer_norm2.bias', 'model.text_mapper.text_encoder.transformer.layers.2.layer_norm2.weight', 'model.text_mapper.text_encoder.transformer.layers.2.mlp.fc1.bias', 'model.text_mapper.text_encoder.transformer.layers.2.mlp.fc1.weight', 'model.text_mapper.text_encoder.transformer.layers.2.mlp.fc2.bias', 'model.text_mapper.text_encoder.transformer.layers.2.mlp.fc2.weight', 'model.text_mapper.text_encoder.transformer.layers.2.self_attn.in_proj_bias', 'model.text_mapper.text_encoder.transformer.layers.2.self_attn.in_proj_weight', 'model.text_mapper.text_encoder.transformer.layers.2.self_attn.out_proj.bias', 'model.text_mapper.text_encoder.transformer.layers.2.self_attn.out_proj.weight', 'model.text_mapper.text_encoder.transformer.layers.3.layer_norm1.bias', 'model.text_mapper.text_encoder.transformer.layers.3.layer_norm1.weight', 'model.text_mapper.text_encoder.transformer.layers.3.layer_norm2.bias', 'model.text_mapper.text_encoder.transformer.layers.3.layer_norm2.weight', 'model.text_mapper.text_encoder.transformer.layers.3.mlp.fc1.bias', 'model.text_mapper.text_encoder.transformer.layers.3.mlp.fc1.weight', 'model.text_mapper.text_encoder.transformer.layers.3.mlp.fc2.bias', 'model.text_mapper.text_encoder.transformer.layers.3.mlp.fc2.weight', 'model.text_mapper.text_encoder.transformer.layers.3.self_attn.in_proj_bias', 'model.text_mapper.text_encoder.transformer.layers.3.self_attn.in_proj_weight', 'model.text_mapper.text_encoder.transformer.layers.3.self_attn.out_proj.bias', 'model.text_mapper.text_encoder.transformer.layers.3.self_attn.out_proj.weight', 'model.text_mapper.text_encoder.transformer.layers.4.layer_norm1.bias', 'model.text_mapper.text_encoder.transformer.layers.4.layer_norm1.weight', 'model.text_mapper.text_encoder.transformer.layers.4.layer_norm2.bias', 'model.text_mapper.text_encoder.transformer.layers.4.layer_norm2.weight', 'model.text_mapper.text_encoder.transformer.layers.4.mlp.fc1.bias', 'model.text_mapper.text_encoder.transformer.layers.4.mlp.fc1.weight', 'model.text_mapper.text_encoder.transformer.layers.4.mlp.fc2.bias', 'model.text_mapper.text_encoder.transformer.layers.4.mlp.fc2.weight', 'model.text_mapper.text_encoder.transformer.layers.4.self_attn.in_proj_bias', 'model.text_mapper.text_encoder.transformer.layers.4.self_attn.in_proj_weight', 'model.text_mapper.text_encoder.transformer.layers.4.self_attn.out_proj.bias', 'model.text_mapper.text_encoder.transformer.layers.4.self_attn.out_proj.weight', 'model.text_mapper.text_encoder.transformer.layers.5.layer_norm1.bias', 'model.text_mapper.text_encoder.transformer.layers.5.layer_norm1.weight', 'model.text_mapper.text_encoder.transformer.layers.5.layer_norm2.bias', 'model.text_mapper.text_encoder.transformer.layers.5.layer_norm2.weight', 'model.text_mapper.text_encoder.transformer.layers.5.mlp.fc1.bias', 'model.text_mapper.text_encoder.transformer.layers.5.mlp.fc1.weight', 'model.text_mapper.text_encoder.transformer.layers.5.mlp.fc2.bias', 'model.text_mapper.text_encoder.transformer.layers.5.mlp.fc2.weight', 'model.text_mapper.text_encoder.transformer.layers.5.self_attn.in_proj_bias', 'model.text_mapper.text_encoder.transformer.layers.5.self_attn.in_proj_weight', 'model.text_mapper.text_encoder.transformer.layers.5.self_attn.out_proj.bias', 'model.text_mapper.text_encoder.transformer.layers.5.self_attn.out_proj.weight', 'model.text_mapper.text_projector.layers.0.0.bias', 'model.text_mapper.text_projector.layers.0.0.weight', 'model.text_mapper.text_projector.layers.1.0.bias', 'model.text_mapper.text_projector.layers.1.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of OneFormerForUniversalSegmentation were not initialized from the model checkpoint at shi-labs/oneformer_ade20k_swin_tiny and are newly initialized because the shapes did not match:
- model.transformer_module.decoder.class_embed.weight: found shape torch.Size([151, 256]) in the checkpoint and torch.Size([3, 256]) in the model instantiated
- model.transformer_module.decoder.class_embed.bias: found shape torch.Size([151]) in the checkpoint and torch.Size([3]) in the model instantiated
- criterion.empty_weight: found shape torch.Size([151]) in the checkpoint and torch.Size([3]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
OneFormerForUniversalSegmentation(
  (model): OneFormerModel( ...
OneFormerProcessor:
- image_processor: OneFormerImageProcessor {
  "class_info_file": "ade20k_panoptic.json",
  "do_normalize": true,
  "do_reduce_labels": false, ...

By the way, isn’t the order of id2label reversed?

{
  "Background": 0,
  "Road": 1
}

should be:

{
  0: "Background",
  1: "Road"
}

I wrote it backwards on accident in the comment, yes that is how it is suppose to be. The Dataloader/data processor being what feeds the images into the model for the finetune. I’m assuming the miss match might be because of that since without the labels it was fine, but only upon adding the id2label match, it causes and issue.

1 Like

I see.

I’m assuming the miss match might be because of that since without the labels it was fine, but only upon adding the id2label match, it causes and issue.

If so, there may be some problem with the generated label and ID sets. For example, if content of id2label is discontinuous, an out of bounds problem is likely to occur.

id2label = {0: "zero", 1: "one"} # fine
id2label = {0: "zero", 4: "four"} # sometimes causes issue

I fixed my problem, this is part of of the issue. Main was in my dataloader file where it was processing the images. I had to include changing IDs there for it to respond in my train file via the id2label thing.

1 Like