How to combine pre-trained weights of components from different multimodal LLMs?

Hi everyone,

I’m currently doing some research on multimodal LLMs and as you know, an MLLM has multiple models from vision, text, and speech, etc. in it. Most MLLMs have vision/ audio encoder(s) to extract features from images, videos, and audio, and some connection modules to adapt the features to the embedding space of the LLM, and usually only the connection modules are trained, with the pre-trained encoders and LLM frozen.

I’m trying to combine some components (e.g. vision encoders) from an MLLM with the architecture of another MLLM, and these MLLMs have their weights stored in safetensors files of Hugging face repos. So my question is, do we have a way to inspect those safetensors files to know which sets of weights correspond to which components in the MLLM? And a possibly more difficult question is that can we combine parts of the weights from this MLLM’s safetensors (e.g. only the vision encoder’s weights) to the weights in safetensors of the other MLLM?

Thanks in advance.

1 Like

Hi , @tcm03

Your research on multimodal LLMs sounds fascinating! And I will give you some tips for your questions.

1. Inspecting safetensors Files

Safetensors files are efficient for storing model weights securely and compactly, but they don’t inherently provide human-readable metadata about what the weights correspond to. To inspect and identify specific components:

  • Use the safetensors Library: The safetensors library allows you to load and manipulate weights programmatically. You can load the file and print the key names, which often correspond to specific components (e.g., vision.encoder.layer.*, text.encoder.layer.*).

    from safetensors.torch import safe_open
    
    file_path = "path/to/model.safetensors"
    with safe_open(file_path, framework="pt") as f:
        for key in f.keys():
            print(key)
    

    This should give you a map of weight names to their corresponding components.

  • Check Model Documentation: Sometimes the structure of the keys is outlined in the Hugging Face model documentation or associated codebase. If not, examining the configuration files (config.json) in the same repository may help.

2. Combining Weights Across MLLMs

Combining weights from different models can be challenging but feasible if you carefully match architectures:

  • Extract Specific Weights: You can extract the desired weights (e.g., vision encoder) by filtering relevant keys using the safetensors library.

    desired_keys = [key for key in f.keys() if key.startswith("vision.encoder")]
    weights = {key: f.get_tensor(key) for key in desired_keys}
    

    Save these weights into a new safetensors file or load them into another model.

  • Adapt to Another Model:

    • If the architectures align well, you can directly map weights by ensuring the key names and dimensions match.
    • If there are structural differences (e.g., different layer sizes), you may need to modify the target model’s architecture or interpolate the weights.
  • Transfer Learning Consideration: When integrating parts of two models, freezing the encoders and fine-tuning only the connection modules (as you mentioned) is a reasonable approach to adapt the new components.

  • Toolkits for Integration: Consider using libraries like Hugging Face Transformers or PEFT for model manipulation. These frameworks make it easier to handle modular architectures and combine weights.

Final Thoughts

Carefully validate the performance of the combined model through fine-tuning and evaluation. Also, be cautious about licensing terms when using weights from different repositories.

Hope this helps, and good luck with your research! Feel free to ask if you need further clarification.

Best regards, hope this help!
Alan

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.