Seeking Advice: Training a TikTok Video Quality Assessment Model Inspired by Deepseek R1

Background

I’m a content creator who produces short videos. To improve my craft, I regularly study high-quality videos to learn techniques and gain inspiration. The short video industry is extremely competitive, and I’m looking to using AI technology to enhance my creative efficiency and video quality.

My Goals

  1. Quality Assessment: I need an AI model that can identify high-quality TikTok videos and explain what makes them excellent (innovative filming techniques, interesting narrative structures, high visual quality, etc.)

  2. Content Potential Recognition: I want the model to automatically identify high-potential content elements within videos, helping me quickly filter out the most valuable creative materials.

Current Resources

I have watched and manually annotated numerous videos, marking the reasons why certain videos are considered high quality.

Proposed Approach

I’m interested in developing a model inspired by Deepseek R1’s reasoning capabilities to evaluate TikTok videos. This model would need reflective and reasoning abilities since video quality standards aren’t strictly quantifiable. It should provide multi-dimensional evaluations covering aspects such as:

  • Content themes

  • Visual effects

  • Narrative structure

  • Audience engagement techniques

  • Emotional resonance

  • And more

Questions

  1. What would be the most effective architecture for such a model?

  2. How should I structure my training data?

  3. What evaluation metrics would be appropriate?

  4. Are there existing models I could fine-tune rather than build from scratch?

  5. What technical challenges should I anticipate?

  6. How much labeled data would I need for reasonable performance?

I appreciate any insights, suggestions, or references to similar projects. Thank you!

1 Like

What would be the most effective architecture for such a model?

This is also the case when using LLM for analysis, such as DeepSeek-R1, but you will need to convert the video and audio information into text first. To convert audio into text, you can use ASR models such as Whisper. There are many examples using YouTube, etc., so I think this will be helpful.

For analyzing and converting video footage to text, you can use video-compatible VLM, etc. It is possible to have VLM itself create descriptions to a certain extent, but there are not many models that can analyze video, so I don’t know much about them either.

Anyway, I recommend exploring Spaces to collect models that you can use, and finding parts and ready-made products that you can use first. Create, adjust, and assemble the missing parts.