I have recently done some work on gesture recognition using sensors attached to e.g. gloves. With a defined set of distinct gestures the model works fairly well. However an idea that sprung up is if it would be possible to use pretrained “general knowledge” models to also predict other gestures. Deep down in, lets say, GPT-2 there might be some knowledge of what a “pointing finger” or a “waving hand” is. With my limited exposure to NLP and transformers: would it be possible to fine-tune a pretrained model so that it tells us some semantic representation of the gesture?
The question is broad and I will try to break it down as far as I have though of it:
The input data is simply the numerical values (fixed size float vector) from the sensors (possibly in a sequence). The first step of using e.g. GPT-2 would be to discard the first textual tokenization and embedding step. I would say that this is an input domain shift and any pointers/discussion about this would be welcome, I have yet to find anything with my google-fu. One approach would perhaps simply be to feed the sensor data to the models directly.
The encoder/decoder steps of the model could perhaps work as is. Slow fine-tuning of these steps so that the general knowledge is preserved is probably important.
The output of the model could probably come in many different forms. I think the most interesting output would be sort of like a summarization of the gesture (e.g. a few tokens). However I have some trouble thinking of how to define the labels during training. When recording gestures for the training data it is easy to come up with many different words for a single gesture (e.g. “victory” or “2” for stretched index and middle finger). Would it be possible to combine several labels into one label? A first step could also simply be a single label just to see “if it works”.
There are many different NLP tasks and the models are generally suited for a specific task. Would GPT-2 be usable to, for example, output a small set of tokens or are other models perhaps better suited?
I would love to have an discussion about this approach and also be pointed to resources that I have (surely) missed.