Custom VLM - Swapping a vision encoder from a VLM

Hi,

can you provide some support on How can I create a custom VLM by changing the Vision encoder (and projector/connector module) from a VLM? E.g. I would like to replace SigLip1 encoder from PaliGemma2 by SigLip2 or use SigLip2 encoder with Qwen2.5 vl - 3B model. In any case, I want to retain the language model part from VLM and LMHead.

I would like to know if this is natively possible with transformers or do I have to fork and patch? also How checkpoint loading will look like when I create a custom VLM by replacing the vision encoder and projector module?

Thanks,
Vishal

2 Likes

That’s quite difficult… I’ll leave a link here that might be helpful.

also How checkpoint loading will look like when I create a custom VLM by replacing the vision encoder and projector module?

Simply like one of these:

  • model = YourNewModelClass.from_pretrained(“your_id/your_checkpoint”, trust_remote_code=True)
  • model = YourNewModelClass.from_pretrained(“./your_local_checkpoint”)

References


To create a custom Vision-Language Model (VLM) by replacing the vision encoder and projector module while retaining the language model and LMHead, you can leverage the Hugging Face Transformers library, which provides tools and classes to support such modifications. Below is a guide on how to approach this:


Approach

The idea is to create a new VLM by combining a custom vision encoder (e.g., SigLIP2) with an existing language model (e.g., Qwen2.5 VL-3B). You can achieve this without forking the Transformers repository by using the provided classes and methods to initialize and combine models.


Process

  1. Load the Vision Encoder and Language Model

    • First, load the vision encoder (e.g., SigLIP2) and the language model (e.g., Qwen2.5 VL-3B) separately.
    • Ensure that the vision encoder includes a projection module to align its output with the language model’s input dimensions.
  2. Modify the Vision Encoder

    • You can modify the vision encoder (e.g., SigLIP2) to replace the existing vision encoder in a VLM.
    • Make sure the output of the vision encoder is compatible with the language model’s input.
  3. Combine the Vision Encoder and Language Model

    • Use the VisionEncoderDecoderModel class (or a similar class) from the Transformers library to combine the modified vision encoder with the language model.
    • This class initializes a model with a given vision encoder and language decoder.
  4. Save the Custom VLM

    • After combining the vision encoder and language model, save the custom VLM for future use or fine-tuning.
  5. Load the Custom VLM

    • When you want to resume training or inference, load the custom VLM from the saved checkpoint.

Code Examples

Example 1: Replacing the Vision Encoder

The following example demonstrates how to replace the vision encoder in a VLM with a custom encoder:

from transformers import VisionEncoderDecoderModel
import torch.nn as nn

# Define the new vision encoder
class CustomVisionEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        # Define your custom vision encoder layers here
        # Example:
        self.conv = nn.Conv2d(3, 64, 3, padding=1)
        self.fc = nn.Linear(64 * 224 * 224, 512)
        
    def forward(self, pixel_values):
        x = self.conv(pixel_values)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Load the language model
language_model = AutoModelForCausalLM.from_pretrained("your_language_model_path")

# Initialize the custom VLM
custom_vlm = VisionEncoderDecoderModel(
    encoder=CustomVisionEncoder(),
    decoder=language_model
)

Example 2: Using a Custom Vision Encoder with a Projector Module

The following example demonstrates how to replace the vision encoder and projector module in a VLM:

from transformers import VisionEncoderDecoderModel
import torch.nn as nn

# Define the new vision encoder with a projector module
class CustomVisionEncoderWithProjector(nn.Module):
    def __init__(self, projection_dim):
        super().__init__()
        # Define your custom vision encoder layers here
        self.conv = nn.Conv2d(3, 64, 3, padding=1)
        self.fc = nn.Linear(64 * 224 * 224, 512)
        self.projector = nn.Linear(512, projection_dim)
        
    def forward(self, pixel_values):
        x = self.conv(pixel_values)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        x = self.projector(x)
        return x

# Load the language model
language_model = AutoModelForCausalLM.from_pretrained("your_language_model_path")

# Initialize the custom VLM with the new encoder and projector
custom_vlm = VisionEncoderDecoderModel(
    encoder=CustomVisionEncoderWithProjector(projection_dim=512),
    decoder=language_model
)

Handling Checkpoint Loading

When you create a custom VLM, you can save and load checkpoints as follows:

Saving the Model

# Save the custom VLM
custom_vlm.save_pretrained("path_to_save")

Loading the Model

# Load the custom VLM
loaded_model = VisionEncoderDecoderModel.from_pretrained("path_to_save")

Conclusion

By utilizing the flexibility of the Transformers library and the VisionEncoderDecoderModel class, you can create a custom VLM by replacing the vision encoder and projector module while retaining the language model and LMHead. This approach does not require forking the repository, as the library’s existing methods and classes provide the necessary support.