Hi, I am learning about transformers in Images and Videos. I wanted to know how a sequence of key point(Facial and Hand landmarks) data can be fed into a transformer model. I want to train a transformer model for Sign Language Translation (Automatic Video 2 text translation).
I am also looking for efficient KeyPoint extraction models to run on a CPU that can be used to preprocess images and videos for dataset creation.