Finetuning GPT2 with user defined loss

@aclifton314 yup, still the case that the gradients won’t flow through the sampling line. Check out this this post

@chrisdoyleIE, if I follow the post it boils down to the sampling methods being non-differentiable. Is this correct? If so, I would wonder about ReLU since it is not differentiable at 0. Would that seem to indicate, then, that it would be sufficient for the functions involved in backpropagation to be differentiable in the neighborhood of the current value of the weight?

I glanced back over the GPT2 and “Attention is all you need” papers and it appears that the decoder used in both papers utilizes something like argmax to sample from the probability distribution to generate the next token for calculating the loss. Maybe I am missing something, but that would seem to indicate that they would run into the same issue raised here.

I’m sure I’m missing something so feel free to walk us through it, as you have studied this material in various contexts.

Paragraph 1:

Correct - sampling is non-differentiable. Regarding ReLU, discontinuities such as the one found at 0 can be avoided in the function’s implementation (so long as operators used are good with PyTorch’s autograd):

# source:
class ReluLayer(Layer):
    """Layer implementing an element-wise rectified linear transformation."""

    def fprop(self, inputs):
        """Forward propagates activations through the layer transformation.
        For inputs `x` and outputs `y` this corresponds to `y = max(0, x)`.
            inputs: Array of layer inputs of shape (batch_size, input_dim).
            outputs: Array of layer outputs of shape (batch_size, output_dim).
        return np.maximum(inputs, 0.)

    def bprop(self, inputs, outputs, grads_wrt_outputs):
        """Back propagates gradients through a layer.
        Given gradients with respect to the outputs of the layer calculates the
        gradients with respect to the layer inputs.
            inputs: Array of layer inputs of shape (batch_size, input_dim).
            outputs: Array of layer outputs calculated in forward pass of
                shape (batch_size, output_dim).
            grads_wrt_outputs: Array of gradients with respect to the layer
                outputs of shape (batch_size, output_dim).
            Array of gradients with respect to the layer inputs of shape
            (batch_size, input_dim).
        return (outputs > 0) * grads_wrt_outputs

    def __repr__(self):
        return 'ReluLayer'


This is a super question and great critical thinking on your part. There are two answers to this - first is the answer you are looking for but second is something else to think about should you manage to implement some differentiable sampling.

Firstly, think about what we need gradients for and when the update is applied. At the first step, the decoder start token is generated from thin air and has no gradient trail, but we don’t need there to be, because by the chain rule, we only need the products from d{loss}/d{output} to d{layer1}/d{input}. Nothing beyond d{layer1}/d{input} is needed to improve the model so we don’t need to compound gradients at each step for each generated token. That is, we never need d{inputs}/d{other stuff} - so it is somewhat suitable that they are 0.

Secondly, let’s consider some wild event where the legacy gradients were causing a problem i.e. sampling is differentiable - how could we remove them? Most implementations (including Vaswani et al 2017 as far as I know, but am open to correction) use an algorithm called teacher forcing which inputs the previous label rather than the previously generated token into the decoder. As with the initial decoder token, this vector can be generated from thin air before being inputted and doesn’t have a gradient.

Finally, I am only a simple postgrad student so please @others correct me if anything seems off…

1 Like

I don’t think I fully understand the response for paragraph 2. If we consider a neural net with:

E: \text{loss function} \\ o_{j} = \phi(net_{j}) = \phi(\sum_{k=1}^{n}w_{kj}o_{j}) \\ \phi: \text{activation function}

Then we can use backpropagation to calculate the derivative of the error wrt the weights of the network:

\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}}\frac{\partial o_{j}}{\partial net_{j}}\frac{\partial net_{j}}{\partial w_{ij}}

This requires the activation function \phi to be differentiable. If I pass in a sentence through the decoder network (i.e. GPT2), following what is done in Vaswani, I can pass the output of the decoder network through a learned liner transformation to expand the final representation over my vocabulary, those logits can be passed through a softmax to get the probability distribution over my vocab.

At this point, a decision must be made about how to choose the next token from this probability distribution. Let’s just suppose that a choice is made somehow. That choice picks the candidate for the next token which can then be passed to the loss function.

But there are no weights associated with the probability distribution nor how the probability distribution was generated (via softmax). I would think you would have to start the backpropagation at the loss function and go back to the linear transformation (which is just after the last decoder layer).

Maybe this picture would be different if we incorporated sampling into the loss function explicitly. But using sampling as a means to chose (even if that were via cosmic ray interference or whichever number came to a toddlers mind) doesn’t affect the gradients in an obvious way, at least not obvious to me.

I’m enjoying this conversation and it’s pushing my understanding :grinning:

All correct, but I feel you may have missed my point (though I re-read my comment and it wasn’t terribly clear).

First: activation function does not have to be continuous in for brop to work in practice, it can be implemented as it is in my previous reply, such that we account for these discontinuities and assign gradients as demonstrated in the code.

Second: yes exactly, sampling via softmax has no weights and is not differentiable. This means we cannot pass gradients through this operation so all losses must be calculated before this step - so your seqeunce-level loss will not work as by the chain rule product you defined above; all dE/dwi will be 0.

If you use an RL agent to sample, then this agent’s policy or action network is differentiable, and so sampling becomes differentiable. If sampling becomes differentiable, then our chain rule goes back through the model right until it isn’t, i.e. back through time. This can be prevented with teacher forcing.

@chrisdoyleIE Thanks for your reply. I think I am starting to understand what you’ve been saying.

Just thinking out loud here: I take what I described above but stop at the softmax (GPT2 -> linear -> softmax). As you’ve pointed out, there are no weights associated with the softmax so the gradients are zero. Is there a way to let pytorch or HF know that I don’t need to calculate gradients for layers that have no weights? Maybe I’m not understanding how pytorch does it’s backpropagation.

As I understand, you need some way to sample a probability distribution to generate a new token for a text generation model. I’m not 100% sure, but it looks like GPT2 is doing something like argmax to generate that next token when it was pretrained. But argmax is not differentiable and thus the loss would not be optimized since the gradients would be zero from that point on.

Is it the case that GPT2 was nominally trained with a RL agent to do their sampling of the next token?

I was trying to fine-tune a GPT2LMHeadModel via Trainer and a Dataset of text, but kept getting issues w.r.t. the tokenizer not being applied to the text, which messed with the data collator. This helped me out a lot, a solution is to encode the text first and have the __getitem__ return a dict with the encoded text under input_ids as you showed here.

HI @aclifton314 , sorry, I want to define some extra loss for gpt 2, in this code logits are the expected labels means that the actual output is logits that I can compare with labels and define new loss function?

    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)


        outputs = model(  b_input_ids,
                          attention_mask = b_masks,
        loss, logits = outputs[:2]

@SUNM As I understand reading your code sample, the logits will be the unnormalized values of model. There might be use for using logits directly in a newly defined loss function, however it is my understanding that the actual values of the logits do not make for a good comparison against the set of labels. Namely, logits are allowed to take on negative values and it isn’t immediately clear if a negative logit value constitutes a large mismatch from a label. Here’s an example.

Suppose your labels are (1,2,3,4,5), corresponding to the size of your vocab. The size of the output vector for logits should be 5, but the values inside could be negative, less than one, very negative, very large, etc. The question becomes, for an element in logits with a negative value (say -12.789) and a label of 4, is that prediction markedly off from the label resulting in a large loss?

I would say you need an extra step. One would need to normalize the logits to put them on the same scale to be able to calculate a more accurate loss. In this case, the softmax function provides this. Taking the softmax of the logits vector results in a probability distribution for that example over your vocab. It might look something like:

p_output = softmax(logits) = (0.09, 0.1, 0.87, 0.33, 0.00008)

If this training example had a label of 4, and one were taking the index of the largest value of p_output as the predicted label (in this example it would be a prediction of 2 since 0.87 is the largest value in p_output), one would calculate the loss as:

new_loss = my_new_loss_fn(2, 4)

To summarize, when you go to write your new loss function it would make sense to use the logits as an input to that function. However, you would need to normalize logits with a softmax before comparing to the labels. Does that make sense? Let me know if something is unclear. I’m happy to help.

1 Like

Hi @aclifton314 , many many thanks for your reply my lable are like b_labels tensor([[50257, 1169, 8224, …, 50258, 50258, 50258],
[50257, 1169, 8224, …, 50258, 50258, 50258],
[50257, 1169, 8224, …, 50258, 50258, 50258],
[50257, 1169, 8224, …, 50258, 50258, 50258]]) with size of (4,400) and the logits as you said yes can be negative as well with size of (4,400,50259) for 4 batch size of training.
[[-1.1065e+02, -1.0942e+02, -1.1557e+02, …, -1.1143e+02,
4.4173e+00, 2.0057e-01],
[-5.8343e+01, -5.5646e+01, -5.7750e+01, …, -5.7685e+01,
3.9150e+00, -8.2319e-01],
[-1.0945e+02, -1.0816e+02, -1.1262e+02, …, -1.0903e+02,
5.1158e+00, -6.7700e-01],. now there are 2 questions, one is the size of logits and label are not same and second one is for negative values is it good to rescal ethe logits in the range of lables? and then compared them?

@SUNM thanks for sending that detail.

Question 1
I would expect that the size of b_labels and logits would not be the same, but this is ok. I would also guess that 50,259 is the size of the vocabulary used and 400 is something like the context window (or the number of tokens that can be fed into the model at one time). How I am understanding this set up is for the context size of 400, b_labels is providing indexes of tokens from the vocabulary as the labels. Taking a look at just a single example like [50257, 1169, 8224, …, 50258, 50258, 50258] my guess is that 1169 and 8224 might correspond to particular words in the vocab.

logits, on the other hand gives the model predictions for each word in the context size over the entire vocabulary. Again, considering the first example in logits, the size of the tensor would be (400, 50259). I take this to mean that the model provided unormalized scores for the first word in the context window for each word in the vocab.

Question 2
From the above, in order to calculate the loss one would need to somehow choose which word in the vocab to use. Here is where normalizing the logits makes sense. The first example in logits has shape (400, 50259), but technically those values span from negative infinity to positive infinity. Which logit score is the best? One answer is to take the softmax of the logit values to place them in the 0 to 1 range (resulting in a probability distribution). That resulting vector is then the probability distribution of the model predictions over the vocabulary for each word in the context window and as such, will have the same size of (400, 50259).

You could then decide which prediction to use in the loss by looking at the index of the softmaxed vector that has the largest value. That position in the vector corresponds to a word in the vocab. As an example, consider the first row of the softmaxed tensor (it has size (1,50259)). It has a largest element. The index of that largest element corresponds to the same index in the vocabulary that the model believes is the most probable answer.

Once you gather all of those indices from the tensor for each example in the batch, the shape of the resulting tensor will be (4,400) and then you can feed that along with the b_labels tensor into your loss function.

I hope that wasn’t too unclear. If it doesn’t make sense, I can provide a concrete sample example to illustrate the concept.

many thanks. very nice. I understand what you mean :). I will apply it and tell u the results.excuse me, do you know how to apply trainer for multiple GPUS? I think it should be some set up inside the code. do you have any samples which works? I read alot and get confused with different strategies

@SUNM It is my understanding that if you train a model with the huggingface Trainer class (as you’ve shown in the post Finetuning GPT2 using Multiple GPU and Trainer) and you have multiple GPUs available on your system, then Trainer will use all of the available GPUs unless explicitly told not to do so.

So, if you have 4 GPUs available but you set the environment variable CUDA_VISIBLE_DEVICES=“1,2,3”, then Trainer will only utilize those 3 GPU instead of all 4 of them.

1 Like

HI @aclifton314 , my model is not trained well. After 16 epoch when I pass the input it gave me bunch of stories which is not in my dataset. it is clear that it is not trained well. do you have any idea? is it possible for you to share the code that the results had make sense to you with Trainer? Indeed I pass a complaint to the gpt and I ask it to create solition for that complaint but it create storiesssss :slight_smile:

@SUNM GPT2 may not be the best model for your application, but before we rule it out would you mind explaining to me what your task is and what you are expecting the model to do?

Hi @aclifton314 , many many thanks for your reply. I really appreciate. the challenge is that I have some problems and for each problem there is a suggested solution. I join the problem and related solution together as input. and fine tune the gpt and then I expect that after fine tunning for the given problem it should generate a good solution. but the solution is not that accurate now. it generates but it is not very related. any idea?