# Generation Probabilities: How to compute probabilities of output scores for GPT2

Now that it is possible to return the logits generated at each step, one might wonder how to compute the probabilities for each generated sequence accordingly.

The following code snippet showcases how to do so for generation with `do_sample=True` for GPT2:

``````import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

input_ids = tokenizer("Today is a nice day", return_tensors="pt").input_ids

generated_outputs = gpt2.generate(input_ids, do_sample=True, num_return_sequences=3, output_scores=True)

# only use id's that were generated
# gen_sequences has shape [3, 15]
gen_sequences = generated_outputs.sequences[:, input_ids.shape[-1]:]

# let's stack the logits generated at each step to a tensor and transform
# logits to probs
probs = torch.stack(generated_outputs.scores, dim=1).softmax(-1)  # -> shape [3, 15, vocab_size]

# now we need to collect the probability of the generated token
# we need to add a dummy dim in the end to make gather work
gen_probs = torch.gather(probs, 2, gen_sequences[:, :, None]).squeeze(-1)

# now we can do all kinds of things with the probs

# 1) the probs that exactly those sequences are generated again
# those are normally going to be very small
unique_prob_per_sequence = gen_probs.prod(-1)

# 2) normalize the probs over the three sequences
normed_gen_probs = gen_probs / gen_probs.sum(0)
assert normed_gen_probs[:, 0].sum() == 1.0, "probs should be normalized"

# 3) compare normalized probs to each other like in 1)
unique_normed_prob_per_sequence = normed_gen_probs.prod(-1)``````
12 Likes

Can I use this to generate sequences only over a probability threshold?

Great to see this very needed feature.

I want to try it out but with transformers 4.2.0 and I see error like â€śTypeError: forward() got an unexpected keyword argument 'return_dict_in_generateâ€ť.

It has to be used with `generate()` - not with `forward()`

No I donâ€™t think so sadly. Such a feature would be very hard to implement though

@patrickvonplaten does return_dict_in_generate and output_scores works only for do_sample=True? or i can use it with beam_search and top_k and top_p?

``````return_dict_in_generate
``````

can be used with all generate methods including `beam_search`

I asked a question regarding the shape of `scores` returned from the generate() function. Why is the length of the `output_scores` always +1 longer than the `max_length` in the output of `generate()`?

1 Like

Can we take gradients with respect to these generated logits?

Hi,

Just wanted to link this Big `generate()` refactor - Transformers - Hugging Face Forums for people who are looking into this in the future! I recently required gradients computed with respect to the logits but was unable to do so until I found the above link.

This discussion: Question about greedy_search - Transformers - Hugging Face Forums was also useful and provided a more concrete example to the above.

Thank you.

I am trying to apply the probability generation for GPT-J but the model.generate() function returns a torch.tensor, meaning there is no attribute generated_outputs.scores

any ideas for a solution?

@patrickvonplaten is it not the case that the history for a beam element `i` at time `t-1` will not generally be a prefix of the history of the element `i` at time `t` (because at each time step we sort the elements of the beam transformers/generation_utils.py at d83b0e0c079f0826d186270a86622ff5f1efd9c1 Â· huggingface/transformers Â· GitHub)â€¦ and therefore the above gather operation will not actually do what is intended here?

Having seen quite some issue now regarding the beam scores computation, I would to clarify a bit how the scores are calculated currently for beam search and why (as noted by many of you) this is not ideal at the moment.

Letâ€™s assume we are running the following beam search example:

``````import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

input_ids = tokenizer("Today is a nice day", return_tensors="pt").input_ids

generated_outputs = gpt2.generate(input_ids, num_beams=2, num_return_sequences=2, output_scores=True, length_penalty=0)
``````

Note that the `length_penalty` is set to 0 here which will facilitate the computation afterward since the `sequence_scores` are by default divided by the generated length of the sequnces since `length_penalty == 1`.
Now we can see that `generated_outputs` has the following keys:

``````['sequences', 'sequences_scores', 'scores']
``````

`sequences` - are just the token id sequences of the 2 most probably beams.

`sequence_scores` - are the cumulative log probabilities of the two most probably beams. It can be formulated as a recursive formula: `sequence_scores[k]_i = sequence_score[k]_{i-1} + log_probs[i-1, :])_topk(2)[k] with `sequence_score[k]_{i=start_token} = 0` (i being the time step).

`scores` - now this is where it becomes confusing and where we should probably change the API. At the moment the `scores[i, j]` are defined as log_probs[i-1, j] + sequence_score[j % vocab_size]_{i-1} whereas `j % vocab_size` essentially defines the beam index.

Now a couple of things can be done to verify this:

``````generated_outputs.scores[-1].topk(2)
``````

yields the same result as

``````generated_outputs.sequences_scores
``````

``````generated_outputs.scores[-1].topk(2).indices[0]
``````

corresponds to the last generated tokens

``````generated_outputs.sequences[:, -1]
``````

The problem now is that the scores donâ€™t really give any information about the probability of token j at time i which is what most people seem to be interested in.

So, Iâ€™ll start a PR to change this behavior.

1 Like

Hi @patrickvonplaten, I am wondering if there is any update on this PR?

2 Likes

Hi @patrickvonplaten, I second @shuyanzhâ€™s request! I have been trying to get the individual logprobs for the tokens that are in the best hypothesis, but itâ€™s not very clear how one can do that (even when the length penalty is 0).

1 Like

I will both provide some explanation & answer a question on this topic. To my knowledge, when using the beam search to generate text, each of the elements in the tuple `generated_outputs.scores` contains a matrix, where each row corresponds to each beam, stored at this step, while the values are the sum of log-probas of the previous sequence and the next token. Consequently, here is a snippet of code, which would theoretically allow getting all the beams stored at each step (I use bart-base from facebook checkpoint, although it should not matter):

``````# Initial beam, which consists of only </s> token with score 1
beams = [[([2], 1)]]
for score_matrix in generated.scores:
# Add a list, which will contain k most probable beams on this step
beams.append([])
# k most probable tokens
topk = score_matrix.view(-1).topk(4)
# Get the sum of the log probs (score) of the beam
topk_scores = topk.values.cpu().detach().numpy()
# Get the indices of the most probable tokens
topk_indices = topk.indices.cpu().detach().numpy()
# Get the tokens ids
topk_tokens = topk_indices % generated.scores[1].shape[1]
# Get the indices of the beams, to which these tokens belong
topk_beams = topk_indices // generated.scores[1].shape[1]
# For each of k most probable tokens
for token, nbeam, score in zip(topk_tokens, topk_beams, topk_scores):
# Take the beam, to which the token belongs
beam = list(beams[-2][nbeam][0])
# Add the token to the beam
beam.append(token)
# Add the beam and its score
beams[-1].append((beam, score))
``````

Outputting the last element with `[(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(beam_data[0])), beam_data[1]) for beam_data in beams[-1]]`results in:

``````[('</s><s>David Harris has called for the BBC to give back to books.</s></s>.</s>',
-8.348045),
('</s><s>David Harris has called for the BBC to do more to promote books and encourage.</s>',
-9.903482),
('</s><s>David Harris has called for the BBC to do more to promote books. awards to do',
-10.104368),
('</s><s>David Harris has called for the BBC to do more to promote books and authors authors.',
-10.360886)]
``````

Since I used the default `length_penalty=1`, to get the true score of the sequence, I do:
`generated.sequences_scores.item() * (len(generated_output.sequences[0]) - 1)`. This score coincides with the score of the most probable beam along the last beams (beam 0), however, the sequences are different. The true sequence outputted with `tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(generated_output.sequences[0]))` results in:

` </s><s>David Harris has called for the BBC to do more to promote books and authors.</s>`

Small scrutiny showed that this text, corresponding to the minimum score (i.e. most probable) (`</s><s>David Harris has called for the BBC to give back to books.</s></s>.</s>`) coincides with the output of the model when `length_penalty=0`, excluding the last 3 â€śrubbishâ€ť tokens. Here is the output when `length_penalty=0`:
`</s><s>David Harris has called for the BBC to give back to books.</s>`

I hope I am missing something here, still it seems pretty vague for me.
May be of interest to @patrickvonplaten

I eventually managed to solve this problem, @Aktsvigun.
I found it simpler to use `length_penalty=0`.
I created a tensor which keeps the scores for the beams that are being updated (similar to what you get w/ the `input_ids`):
`running_scores = torch.cat([running_scores[beam_idx, :], beam_scores.unsqueeze(-1)], dim=-1)`
Then, when we add the hypothesis of the finished beams in `beam_scorer.process`, I also send the `running_scores`. Finally, I concatenate the sumlogprobs (`next_score`) to the `running_scores`. Then, when you get the most likely hypothesis in `finalize`, the tuple contains the `running_scores`.
To get the logprobs for each token, one would just need to get the consecutive increments (negative here) in `running_scores`.

Hi @nunonmg, thanks for your solution! Still, `length_penalty=0` prompts the model to generate shorter sequences, which may negatively affect their quality. Namely, in my example, `length_penalty=0` results in a sequence `David Harris has called for the BBC to give back to books.`, while the summary generated `length_penalty=1` is `David Harris has called for the BBC to do more to promote books and authors.`, which is longer and more accurate.

Yes, one can use `length_penalty=0` just for confirmation purposes. As I am using the `beam_scores`, these are the cumulative sums (as if `length_penalty=0`). The `length_penalty` is only used when you compute the score of the finished hypothesis. Thus, if you use the setting that I mentioned, the final beam score would be the last token score divided by the length of the hypothesis.

1 Like

Thank you! But why are the scores the same in my example, given that they â€ścorrespondâ€ť to different texts? In addition, how can I â€śrestoreâ€ť the correct sequence having `length_penalty=0`?