Advise on model design, fine-tune model to output text given numerical values

I have recently done some work on gesture recognition using sensors attached to e.g. gloves. With a defined set of distinct gestures the model works fairly well. However an idea that sprung up is if it would be possible to use pretrained “general knowledge” models to also predict other gestures. Deep down in, lets say, GPT-2 there might be some knowledge of what a “pointing finger” or a “waving hand” is. With my limited exposure to NLP and transformers: would it be possible to fine-tune a pretrained model so that it tells us some semantic representation of the gesture?

The question is broad and I will try to break it down as far as I have though of it:

  1. The input data is simply the numerical values (fixed size float vector) from the sensors (possibly in a sequence). The first step of using e.g. GPT-2 would be to discard the first textual tokenization and embedding step. I would say that this is an input domain shift and any pointers/discussion about this would be welcome, I have yet to find anything with my google-fu. One approach would perhaps simply be to feed the sensor data to the models directly.

  2. The encoder/decoder steps of the model could perhaps work as is. Slow fine-tuning of these steps so that the general knowledge is preserved is probably important.

  3. The output of the model could probably come in many different forms. I think the most interesting output would be sort of like a summarization of the gesture (e.g. a few tokens). However I have some trouble thinking of how to define the labels during training. When recording gestures for the training data it is easy to come up with many different words for a single gesture (e.g. “victory” or “2” for stretched index and middle finger). Would it be possible to combine several labels into one label? A first step could also simply be a single label just to see “if it works”.

  4. There are many different NLP tasks and the :hugs: models are generally suited for a specific task. Would GPT-2 be usable to, for example, output a small set of tokens or are other models perhaps better suited?

I would love to have an discussion about this approach and also be pointed to resources that I have (surely) missed.

Hi johank.

(I am not an expert in any of these areas.)

I don’t think GPT-2 has any knowledge of anything “Deep down”. The way the model works is only probabilistic. It doesn’t automatically “know” even simple things like sizes. If you ask it how many pints to a gallon, it might be able to tell you, but it might also generate a paragraph that implies that a pint is bigger than a gallon, without “realising” that it should check for that kind of error.

I suppose, if GPT2 has seen enough descriptions of a “pointing finger” it might be able to associate a description with the label, but I don’t think that’s what you are after.

There is almost certainly a better understanding of “pointing finger” inside your head than in GPT2. Although you are having trouble thinking of how to define the labels during training, I think you would be better at it than GPT2.

If you “discard the first textual tokenization and embedding step”, then the whole trained GPT2 effectively becomes untrained.

When people make a “victory” gesture they mean something completely different to a “2” gesture. If GPT2 “knows” about either or both of those gestures, it will “know” about them as completely different things. It is unlikely that GPT2 will ever have been told that the two physical manifestations are similar.

When you say “gesture”, do you mean a single hand shape, or is the motion important?

I would be interested to see whether a neural network could learn to distinguish between “victory” and “2”. Obviously, it can only learn to distinguish them if the training data has something different about them. I imagine it might have, particularly if your data includes motion and not merely shape

GPT2 might possibly have some memory about what other text is commonly found in the vicinity of the words “pointing finger”.

You could test out what gestures GPT2 already “knows” about, by feeding it some starter text such as “He made a gesture of …” and looking at the probabilities for the next words.

GPT2 is very (very) large, and would need a lot of time and a lot of data to train it. I think a smaller model would be more suitable.

My guess is that you don’t want a pre-trained text model at all. I could be wrong about that.

Hello and thank you for the response.

When I say “Deep down” I mean what you say: that it might have some association between either close by gestures or in our case unseen gestures.

I was perhaps a bit unclear but the main goal is some kind of model that given a set of training gestures could predict new gestures. It is fairly straight forward to do classification which we successfully have done on a small set of gestures but if we would want to predict “general” or “new” gestures we need a different methodology. The large pre-trained textual transformers would be one interesting approach.

An example could be that we fine-tune with the gestures 0 to 9 and see if it could learn when we hold up all of our fingers. If we would in any way (paragraph or not) get an output that even alludes to 10 it is a great start. And I do not think it is completely unreasonable.

Gestures in our case can include the motion. What we read from our gloves is the relative angles of all joints. If we pack these in a sequence the motion is also captured. A start would be static gestures from a single hand.

Regarding the point of discarding the embedding step, I did a quick check with GPT2LMHeadModel in :hugs: and the following output was given with torchinfo on “gpt2” (i.e. smallest version):

GPT2LMHeadModel –
├─GPT2Model: 1-1 –
│ └─Embedding: 2-1 38,597,376
│ └─Embedding: 2-2 786,432
│ └─Dropout: 2-3 –
│ └─ModuleList: 2-4 –
│ │ └─GPT2Block: 3-1 7,087,872
│ │ └─GPT2Block: 3-2 7,087,872
│ │ └─GPT2Block: 3-3 7,087,872
│ │ └─GPT2Block: 3-4 7,087,872
│ │ └─GPT2Block: 3-5 7,087,872
│ │ └─GPT2Block: 3-6 7,087,872
│ │ └─GPT2Block: 3-7 7,087,872
│ │ └─GPT2Block: 3-8 7,087,872
│ │ └─GPT2Block: 3-9 7,087,872
│ │ └─GPT2Block: 3-10 7,087,872
│ │ └─GPT2Block: 3-11 7,087,872
│ │ └─GPT2Block: 3-12 7,087,872
│ └─LayerNorm: 2-5 1,536
├─Linear: 1-2 38,597,376

Total params: 163,037,184

The embedding parameters are a large part of the model and you are probabably correct in that removing them might potentially throw away all the knowledge in the Blocks. Compared to e.g. changing the last linear layer, the embedding layer might be necessary to make use of what is stored in the Blocks. However I would be very interested in seeing if there are other similar approaches.
It is clear the the embedding layer is by no means the largest part of the model.

I think a pre-trained text model is necessary for the “deep down” knowledge. However, it is possible that there are other approaches. I have just become aware of Zero-shot learning which I will take a look at. But the usage of e.g. GPT-2 would still be very interesting (and very much in line with other awesome usages).