Fine-tune CLIPSeg with (image, mask) dataset

I want to fine-tune CLIPSeg on my own dataset of grooves in geologic images. I have images and their binary masks. I understand it would involve fine-tuning the decoder since CLIPSeg uses a frozen CLIP as the encoder. I also know I need to add a textual aspect too (example input: for an image of the terrain, with the binary mask indicating the groove, with a text description of “long grooves”).

I can’t figure out how to format my dataset such that CLIPSeg will take it and train with it.

I think the inputs need to be torch tensors in dictionaries with the keys “input_ids”, “attention_mask” and “position_ids”, or maybe “conditional_pixel_values” ??

CLIPSeg link: https://huggingface.co/docs/transformers/model_doc/clipseg
HuggingFace CLIPSeg model on GitHub: https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/models/clipseg/modeling_clipseg.py#L1333

(The usual guides for fine-tuning a pre-trained huggingface model don’t seem to apply since CLIPSeg takes in two images and text).

Any help appreciated.

1 Like

@nielsr if you have any suggestions I’d appreciate it!

I am trying to fine-tune it as well, just some information, not tutorial.
According to the code, only the logits matters.

loss = None
if labels is not None:
    loss_fn = nn.BCEWithLogitsLoss()
    loss = loss_fn(logits, labels)

My mask is (352, 252) with value from 0 to 255 (I set background as 255, and actually I only have one class, so I set the texts = [“bababa”]). I just add my mask as “labels” in the output of processor because it says:

        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

Actually I successfully run the trainer, but the result is werid.

And I know it must be wrong that my loss looks like this:


{'loss': 362.039, 'learning_rate': 9.991666666666666e-05, 'epoch': 0.04}
{'loss': -346.3282, 'learning_rate': 9.983333333333334e-05, 'epoch': 0.08}
{'loss': -1082.6993, 'learning_rate': 9.975000000000001e-05, 'epoch': 0.12}
{'loss': -1596.7958, 'learning_rate': 9.966666666666667e-05, 'epoch': 0.17}
{'loss': -2104.491, 'learning_rate': 9.958333333333335e-05, 'epoch': 0.21}

Overall, the problem is that: my given labels in inputs is (352, 352), but in the forward function. It becomes (len(texts), 352, 352). And I do not know where it got changed.

1 Like

I don’t have a solution but am still working on this. How did you pass your training dataset to the Trainer? I’m currently getting this error message…

/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py in (.0) 49 data = self.dataset.getitems(possibly_batched_index) 50 else: —> 51 data = [self.dataset[idx] for idx in possibly_batched_index] 52 else: 53 data = self.dataset[possibly_batched_index]
KeyError: 2

…with my data passed like this:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dict,
    eval_dataset=val_dict,
    compute_metrics=compute_metrics,
)
trainer.train()

Where my dataset looks like this:

train_dict = {}
train_dict['images'] = train_images # list of np.arrays
train_dict['texts'] = ["long grooves"]*len(train_images)
train_dict['labels'] = train_masks # list of np.arrays

My images each have the shape (300,300) as does their binary masks (which are 0’s and 1’s - I had an issue with a different model when they were 0 and 255).
Thanks