# 'Simple' regression transformer

How do I make a ‘generic’ transformer model for guessing a point in the middle of a sequence? More specifically:

assume I have a sequence of 2N+1 vectors. Masking the middle vector, I would like to train a model to predict it, using the context vectors on both sides (so, 2N context vectors in total).

This seems like a pretty straightforward thing to do with transformers, but when I try to get into the documentation, it seems all the examples are assuming classification of individual vectors?

I apologize if this is already covered in some tutorial, I couldn’t find a likely one when I searched.

Not sure If I understand the question correctly, but if you want to do regression/classification, what you usually have to do is add a layer on top of Bert (Dense layer, for example, in Tensorflow). Then you train the model (either only the dense layer or both the Dense layer and the BERT). Make sure you don’t add activation function in the dense layer bcs u need regression.

I want the target to be the masked vector in the input. I suspect I will have to retrain the model from scratch (the input vectors have nothing in common with words).

so, your suggestion is to add a fully connected layer on top of a BERT model, with the same output dimension as the dimensionality of my input vectors? I was hoping I could use the fact that my task here is exactly one of the subtasks that BERT is trained with anyway?

Well, the way that predicting masked tokens works is feeding the bert embedding to a neural network which outputs probabilities over all the words in the corpus (a classical softmax). After the training that last neural net is dropped from the model (as far as I know) and only the layers before that remain.

So, in your case, you should add a fully connected layer with an output of dimension that is the dimension of your input vectors (as you said) and you can apply some loss (MSE, for example, depending on your vectors). The input may be the concatenation of your left and right vectors. This is my first thought, there may be a more appropriate solution.

Anyway, you don’t have to retrain BERT from scratch. You have to train the regression head on top of it that you are going to create.

1 Like