Video Classification

Hi everyone,
I am starting to look into the task of classifying videos, trying to understand what approaches are currently available.

Naively speaking, I guess one could randomly (maybe better, uniformly) sample N frames from a video, perform classification on each of them, and then aggregate predictions (most frequent prediction, most confident prediction, etc.). This may be reasonable for simple classification tasks (e.g. is there a cat in this video? Is the video set indoors or outdoors?).

On the other hand, this approach would lose any temporal information conveyed by the frame sequence and the sound/speech information, for which a multi-modal model that can process sequences would be required.

So I was wondering if any of you can point out examples of models that have been proposed/used for video classification in any of these directions.
I tried browsing the HuggingFace directory but could not find a “video classification” task category, and I have the feeling (after some web searching) that this topic is generally less covered than image or text classification.

Any pointer/suggestion is very much appreciated :pray: