[Announcement] Generation: Get probabilities for generated output

@vblagoje did you ever finish this?

No, I haven’t! I’ll come back to it eventually :slight_smile:

How would one compute the probability of the entire generated sequence given the input when using top p sampling? This means that we would be setting do_sample=True, and setting the top_p and top_k parameters in generate() to some number between 0 and 1.0

1 Like

Good point

1 Like

@joaogante I get this error for the flan model “AttributeError: ‘T5ForConditionalGeneration’ object has no attribute ‘compute_transition_scores’”

@SaraAmd at first glance, that error should only occur if your version of transformers is not up to date (e.g. v4.30). If you’re using v4.30 and the issue persists, please open a GitHub issue :slight_smile:

@joaogante Thank you for the reply. I updated the transformer library and it works. I just have one question when a predicted output( world) is made of sub tokens how can we get the final probabilities for that particular generated output (world) as the final probability? Should take the mean of the probability?

@SaraAmd you should take the product of the sub-word probabilities (or the sum of their log-probabilities) :slight_smile:

If you’d like to dive deeper into why you should do it this way, have a look at this blog post.

Hey @joaogante

Really enjoy the new function and the explanation you give here!

I have a silly question as I want to dig a bit deeper into the implementation details, let me know if my question make sense: the scores generated are conditional on previous tokens (let’s don’t worry about normalization), i.e. log(P(y_i|y_1, …, y_{i-1}, x)). This said, in order to generate the score or logits for token[i], we need to input all the previous tokens token[:i].
My questions is : how is this implemented, doesn’t this mean that the time complexity for generating the probability is linear in the length of tokens?

To make it more concrete, let’s compare two inputs:

  • Seq1 = “Today is a nice day,”

  • Seq2 = “Today is a nice day, I played soccer with Messi in World Cup.”

For each of them, I pass it to the model and generate the output

import time

inputs = tokenizer(["Today is a nice day, I played soccer with Messi in World Cup."], return_tensors="pt")

print(inputs.input_ids)

start_time = time.time()

with torch.no_grad():

outputs = model(inputs.input_ids)

end_time = time.time()

execution_time = end_time - start_time

print(execution_time)

Here are the output I get from running this on Seq1 vs Seq2 :

  • Seq1 : “Today is a nice day,”
    Input Tensor: tensor([[8888, 318, 257, 3621, 1110, 11]])
    Output outputs.logits.shape returns torch.Size([1, 6, 50257])
    run time: 0.16553473472595215
  • Seq2 : “Today is a nice day, I played soccer with Messi in World Cup.”
    Input Tensor: tensor([[ 8888, 318, 257, 3621, 1110, 11, 314, 2826, 11783, 351, 36128, 287, 2159, 5454, 13]])
    Output outputs.logits.shape returns torch.Size([1, 15, 50257])
    run time: 0.18927574157714844

I understand that 1 in [1,15, 50527] is referring to the batch size, 50527 in [1,15, 50527] is the vocab size, and the middle one is the length of the input token, and the output is the conditional proba for the next token over the entire vocab.

What I am curious about is whether there is some sort of parallelization applied within the model. Otherwise, the conditional proba can only be done by going through the tokens one after another (like a for loop). Maybe I am wrong, would love to learn how is this implemented, in particular is there any for loop to go over the sequences so that outputs.logits returns shape torch.Size([1, 15, 50257]), or is it parallelized. Will really appreciate if you can help pointing me to the implementation doc.

[The reason I am asking is that run time for Seq1(0.166s) vs Seq2 (0.189s) seems a bit optimistic and making me feel like there are some parallelization]

Thank you in advance!

Hi @joaogante, since you said the scores are the “log probabilities”, I was wondering how do I get the “raw logits” without any transformation I applied. (I know I can maybe just exp() the score returned by the .generate() function, but I was wondering if there’s any flag/parameter that I could pass)

Update: looks like it’s an active issue getting raw logits from .generate() (github)

@khalidsaifullaah - As @joaogante mentions, the scores are actually UNNORMALIZED log probabilities - or what is often also referred to as the model logits / raw logits. I’ve always found the term UNNORMALIZED log probabilities a bit confusing, since to me it is easily confused with log probabilities. Here is a decent post discussing where this naming comes from. Essentially it is a product of the softmax function that we use to get normalized probabilities, which computes an exponential and then normalizes. So the input to the softmax (the logits) are these “unnormalized log probabilities”.

But anyways, in some sense the scores are the raw logits, except for the caveat that they are the raw logits AFTER being updated by the logit processors - which may re-weight or set to -inf certain logits through parameters like repetition_penalty, top_p, top_k, …

Hey @vblagoje , is it correct to use the same input ids as labels when calling the forward pass on the model? outputs = model(encoded_input_texts.input_ids, labels=encoded_input_texts.input_ids)
I am trying to implement this for T5 and we need to give the labels field, but was not sure what to provide there as we want to calculate the log probability of a single standalone sequence, and there is no output per se

Hey @joaogante and @vblagoje, thanks for sharing your implementation to this problem. This was exactly what I was looking for!

Looking at your output, I was wondering whether why only the first token of the last sentence gets skipped, why not every first token of a sentence?

How do I add the additional padding to get the logits for the first token of each sentence in the list? For my application, I want to compute the probability scores for each token in a list of sentences. However, the probability of the tokens is slightly different depending on whether the first token was considered or not.

Additionally, do I understand this correctly that the probability for the first token corresponds to the unconditional probability of that token (i.e., based on its frequency)?

Thank you for your time and help on this matter.

2 Likes

@joaogante Thank you so much for the reply. Your responses are very helpful in the thread. I have a problem that I am dealing with for months and I really appreciate a help.
I am using the model for classification. I have three classes. So it is a 3-class classification problem. The class labels are: Not vivid, moderately vivid, highly vivid. The model predicts the class labels. But I need to get ** the probability of each class** similar to BERT model. If I fine-tuned a BERT model, It is easy to get the probability of each class. We need to add a SoftMax layer to the last year which returns the logits for each class. But the performance of BERT is not good for my scenario and using a generative model like T5 or Flan has a good performance. But I don’t know how to get the probabilities for each class using these generative models which output the probability distribution over the vocab not over the classes.

@joaogante

Hi, sir.

Could you please clarify one thing?

In the documentation of compute_transition_scores
It said that this function return

A torch.Tensor of shape (batch_size*num_return_sequences, sequence_length) containing the transition scores (logits)

But in the example below, that

for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string | log probability | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.2%}")
|   262 |  the     | -1.414 | 24.33%
|  1110 |  day     | -2.609 | 7.36%
|   618 |  when    | -2.010 | 13.40%
|   356 |  we      | -1.859 | 15.58%
|   460 |  can     | -2.508 | 8.14%

The example prints log prob and prob for each token.
So, I am confusing the terminology here because, as per my understanding, logit is not log prob.

Thank you so much for the implementation @joaogante !!

@Sboeve, did you find out why only the first token of the last sentence gets skipped, why not every first token of a sentence?

Yes, the final sentence in the list has the most tokens of the three. The first two sentences thus get padded to reach the same dimensions as the final sentence. Since there is no additional padding to the left of the final sentence the first token gets dropped.

Adding a bos token to the inputs will ensure that also the probability of the first token of the longest sentence gets returned.

@amrtanair I hope this helps!