Generation Probabilities: How to compute probabilities of output scores for GPT2

It has to be used with generate() - not with forward() :slight_smile:

No I don’t think so sadly. Such a feature would be very hard to implement though

@patrickvonplaten does return_dict_in_generate and output_scores works only for do_sample=True? or i can use it with beam_search and top_k and top_p?


can be used with all generate methods including beam_search

I asked a question regarding the shape of scores returned from the generate() function. Why is the length of the output_scores always +1 longer than the max_length in the output of generate()?

1 Like

Can we take gradients with respect to these generated logits?


Just wanted to link this Big generate() refactor - :hugs:Transformers - Hugging Face Forums for people who are looking into this in the future! I recently required gradients computed with respect to the logits but was unable to do so until I found the above link.

This discussion: Question about greedy_search - :hugs:Transformers - Hugging Face Forums was also useful and provided a more concrete example to the above.

Thank you.

I am trying to apply the probability generation for GPT-J but the model.generate() function returns a torch.tensor, meaning there is no attribute generated_outputs.scores

any ideas for a solution?

1 Like

@patrickvonplaten is it not the case that the history for a beam element i at time t-1 will not generally be a prefix of the history of the element i at time t (because at each time step we sort the elements of the beam transformers/ at d83b0e0c079f0826d186270a86622ff5f1efd9c1 · huggingface/transformers · GitHub)… and therefore the above gather operation will not actually do what is intended here?

Having seen quite some issue now regarding the beam scores computation, I would to clarify a bit how the scores are calculated currently for beam search and why (as noted by many of you) this is not ideal at the moment.

Let’s assume we are running the following beam search example:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

input_ids = tokenizer("Today is a nice day", return_tensors="pt").input_ids

generated_outputs = gpt2.generate(input_ids, num_beams=2, num_return_sequences=2, output_scores=True, length_penalty=0)

Note that the length_penalty is set to 0 here which will facilitate the computation afterward since the sequence_scores are by default divided by the generated length of the sequnces since length_penalty == 1.
Now we can see that generated_outputs has the following keys:

['sequences', 'sequences_scores', 'scores']

sequences - are just the token id sequences of the 2 most probably beams.

sequence_scores - are the cumulative log probabilities of the two most probably beams. It can be formulated as a recursive formula: sequence_scores[k]_i = sequence_score[k]_{i-1} + log_probs[i-1, :])_topk(2)[k] with sequence_score[k]_{i=start_token} = 0` (i being the time step).

scores - now this is where it becomes confusing and where we should probably change the API. At the moment the scores[i, j] are defined as log_probs[i-1, j] + sequence_score[j % vocab_size]_{i-1} whereas j % vocab_size essentially defines the beam index.

Now a couple of things can be done to verify this:


yields the same result as


Also the correct sequence id can be retrieved from the scores:


corresponds to the last generated tokens

generated_outputs.sequences[:, -1]

The problem now is that the scores don’t really give any information about the probability of token j at time i which is what most people seem to be interested in.

So, I’ll start a PR to change this behavior.

1 Like

Hi @patrickvonplaten, I am wondering if there is any update on this PR?


Hi @patrickvonplaten, I second @shuyanzh’s request! I have been trying to get the individual logprobs for the tokens that are in the best hypothesis, but it’s not very clear how one can do that (even when the length penalty is 0).

1 Like

I will both provide some explanation & answer a question on this topic. To my knowledge, when using the beam search to generate text, each of the elements in the tuple generated_outputs.scores contains a matrix, where each row corresponds to each beam, stored at this step, while the values are the sum of log-probas of the previous sequence and the next token. Consequently, here is a snippet of code, which would theoretically allow getting all the beams stored at each step (I use bart-base from facebook checkpoint, although it should not matter):

# Initial beam, which consists of only </s> token with score 1
beams = [[([2], 1)]]
for score_matrix in generated.scores:
    # Add a list, which will contain k most probable beams on this step
    # k most probable tokens
    topk = score_matrix.view(-1).topk(4)
    # Get the sum of the log probs (score) of the beam
    topk_scores = topk.values.cpu().detach().numpy()
    # Get the indices of the most probable tokens
    topk_indices = topk.indices.cpu().detach().numpy()
    # Get the tokens ids
    topk_tokens = topk_indices % generated.scores[1].shape[1]
    # Get the indices of the beams, to which these tokens belong
    topk_beams = topk_indices // generated.scores[1].shape[1]
    # For each of k most probable tokens
    for token, nbeam, score in zip(topk_tokens, topk_beams, topk_scores):
        # Take the beam, to which the token belongs
        beam = list(beams[-2][nbeam][0])
        # Add the token to the beam
        # Add the beam and its score
        beams[-1].append((beam, score))

Outputting the last element with [(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(beam_data[0])), beam_data[1]) for beam_data in beams[-1]]results in:

[('</s><s>David Harris has called for the BBC to give back to books.</s></s>.</s>',
 ('</s><s>David Harris has called for the BBC to do more to promote books and encourage.</s>',
 ('</s><s>David Harris has called for the BBC to do more to promote books. awards to do',
 ('</s><s>David Harris has called for the BBC to do more to promote books and authors authors.',

Since I used the default length_penalty=1, to get the true score of the sequence, I do:
generated.sequences_scores.item() * (len(generated_output.sequences[0]) - 1). This score coincides with the score of the most probable beam along the last beams (beam 0), however, the sequences are different. The true sequence outputted with tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(generated_output.sequences[0])) results in:

</s><s>David Harris has called for the BBC to do more to promote books and authors.</s>

Small scrutiny showed that this text, corresponding to the minimum score (i.e. most probable) (</s><s>David Harris has called for the BBC to give back to books.</s></s>.</s>) coincides with the output of the model when length_penalty=0, excluding the last 3 “rubbish” tokens. Here is the output when length_penalty=0:
</s><s>David Harris has called for the BBC to give back to books.</s>

I hope I am missing something here, still it seems pretty vague for me.
May be of interest to @patrickvonplaten

I eventually managed to solve this problem, @Aktsvigun.
I found it simpler to use length_penalty=0.
I created a tensor which keeps the scores for the beams that are being updated (similar to what you get w/ the input_ids):
running_scores =[running_scores[beam_idx, :], beam_scores.unsqueeze(-1)], dim=-1)
Then, when we add the hypothesis of the finished beams in beam_scorer.process, I also send the running_scores. Finally, I concatenate the sumlogprobs (next_score) to the running_scores. Then, when you get the most likely hypothesis in finalize, the tuple contains the running_scores.
To get the logprobs for each token, one would just need to get the consecutive increments (negative here) in running_scores.

Hi @nunonmg, thanks for your solution! Still, length_penalty=0 prompts the model to generate shorter sequences, which may negatively affect their quality. Namely, in my example, length_penalty=0 results in a sequence David Harris has called for the BBC to give back to books., while the summary generated length_penalty=1 is David Harris has called for the BBC to do more to promote books and authors., which is longer and more accurate.

Yes, one can use length_penalty=0 just for confirmation purposes. As I am using the beam_scores, these are the cumulative sums (as if length_penalty=0). The length_penalty is only used when you compute the score of the finished hypothesis. Thus, if you use the setting that I mentioned, the final beam score would be the last token score divided by the length of the hypothesis.

1 Like

Thank you! But why are the scores the same in my example, given that they “correspond” to different texts? In addition, how can I “restore” the correct sequence having length_penalty=0?

Looks like the pull request is here: and is implemented in transformers v4.16.0

Can you please explain the scores returned in generate in details. In particular, when we use a batch_size > 1.
Why applying argmax() on scores does not give the same thing as in sequences ?
With batch_size > 1, why the scores shape is not (batch_size, beam_nums, vocab_len) instead of (batch_size*beam_nums, vocab_len). It is really so confused.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
unk_index = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
eos_index = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")


# sequences
seq1 = "summarize: I am confused! I am confused"
seq2 = "summarize: why generate does not work with batch_size >1"

# encoding input and attention mask
encoding = tokenizer(
    [seq1, seq2],

input_ids, attention_mask ="cuda"),"cuda")
output = model.generate(input_ids,
                     early_stopping=False, # to get len(scores) = sequences max_length
tokenizer.batch_decode(output.sequences, skip_special_tokens=True)

# output.sequences
# tensor([[    0,    27,   183, 11319,    55,     1,     0,     0,     0,     0,
#              0,     0,     0,     0],
#         [    0,  3806,   405,    59,   161,    28, 11587,   834,  7991,  2490,
#            536,     3,     5,     1]], device='cuda:0')

# How to get the above indices using output.scores ??


Could you elaborate on how you chose as your method of obtaining an unique probability per sequence? Why not use gen_probs.mean(-1) for the average probability score per sequence?