How to feed transformers with Keypoints data?

Hi, I am learning about transformers in Images and Videos. I wanted to know how a sequence of key point(Facial and Hand landmarks) data can be fed into a transformer model. I want to train a transformer model for Sign Language Translation (Automatic Video 2 text translation).

I am also looking for efficient KeyPoint extraction models to run on a CPU that can be used to preprocess images and videos for dataset creation.