Prompt-tuning for Multimodal model

I am currently working on multimodal model, llava-next-video for video classification. I would like to try prompt tuning on that model. When I run through this notebook example for prompt tuning and the documentation, I did not find specification for the prompt data for multimodal prompting. In my case, I use following chat template,

template= [
        {
            "role": "user",
            "content": [
                {"type": "video"},
                {
                    "type": "text",
                    "text": (
                        "Please classify the behaviour in the video if it contain punching"
                    )
                }
            ]
        }
    ]

Is there any reference or github repo on how to use peft for prompt tuning to multimodal prompting using chat template?

1 Like

Perhaps this is an example.