I want to fine-tune CLIPSeg on my own dataset of grooves in geologic images. I have images and their binary masks. I understand it would involve fine-tuning the decoder since CLIPSeg uses a frozen CLIP as the encoder. I also know I need to add a textual aspect too (example input: for an image of the terrain, with the binary mask indicating the groove, with a text description of “long grooves”).
I can’t figure out how to format my dataset such that CLIPSeg will take it and train with it.
I think the inputs need to be torch tensors in dictionaries with the keys “input_ids”, “attention_mask” and “position_ids”, or maybe “conditional_pixel_values” ??
CLIPSeg link: https://huggingface.co/docs/transformers/model_doc/clipseg
HuggingFace CLIPSeg model on GitHub: https://github.com/huggingface/transformers/blob/dacd34568d1a27b91f84610eab526640ed8f94e0/src/transformers/models/clipseg/modeling_clipseg.py#L1333
(The usual guides for fine-tuning a pre-trained huggingface model don’t seem to apply since CLIPSeg takes in two images and text).
Any help appreciated.