I am currently working on multimodal model, llava-next-video for video classification. I would like to try prompt tuning on that model. When I run through this notebook example for prompt tuning and the documentation, I did not find specification for the prompt data for multimodal prompting. In my case, I use following chat template,
template= [
{
"role": "user",
"content": [
{"type": "video"},
{
"type": "text",
"text": (
"Please classify the behaviour in the video if it contain punching"
)
}
]
}
]
Is there any reference or github repo on how to use peft for prompt tuning to multimodal prompting using chat template?