Model (LLM) for Video Classification

I’ve collected a dataset of ~1300 short clipped videos between 0~50 seconds in Persian (a low-resource language) for sentiment and emotion classification tasks.
I’ve also extracted their .mp3 and transcribed the content of each video.
I could not find a paper in relevant area for sentiment analysis over videos using new LLMs (better say MLLMs). We prefer to apply in-context learning and prompting for this task. Fine-tuning might be expansive due to our hardware limitations.
If you’ve tried something in similar area please share with me. Any suggestions is appreciated.

Thank you!

Qwen3-Omni models might be appropriate and good enough without further fine-tuning if you need a local model.

If privacy is not a major issue, I would use Gemini 3 Flash via API. The batched version of the API supposedly doesn’t even use the data for training their models but still, I would only consider it in case privacy or licensing is not an issue.

This option is cheap (you will have a fairly long input which is cheap and a very short output) which makes it quite cheap in total to use given almost top of the line performance. And it supports input of both audio and video.