It has to be used with `generate()`

- not with `forward()`

No I donâ€™t think so sadly. Such a feature would be very hard to implement though

@patrickvonplaten does return_dict_in_generate and output_scores works only for do_sample=True? or i can use it with beam_search and top_k and top_p?

```
return_dict_in_generate
```

can be used with all generate methods including `beam_search`

I asked a question regarding the shape of `scores`

returned from the generate() function. Why is the length of the `output_scores`

always +1 longer than the `max_length`

in the output of `generate()`

?

Can we take gradients with respect to these generated logits?

Hi,

Just wanted to link this Big `generate()`

refactor - Transformers - Hugging Face Forums for people who are looking into this in the future! I recently required gradients computed with respect to the logits but was unable to do so until I found the above link.

This discussion: Question about greedy_search - Transformers - Hugging Face Forums was also useful and provided a more concrete example to the above.

Thank you.

I am trying to apply the probability generation for GPT-J but the model.generate() function returns a torch.tensor, meaning there is no attribute generated_outputs.scores

any ideas for a solution?

@patrickvonplaten is it not the case that the history for a beam element `i`

at time `t-1`

will not generally be a prefix of the history of the element `i`

at time `t`

(because at each time step we sort the elements of the beam transformers/generation_utils.py at d83b0e0c079f0826d186270a86622ff5f1efd9c1 Â· huggingface/transformers Â· GitHub)â€¦ and therefore the above gather operation will not actually do what is intended here?

Having seen quite some issue now regarding the beam scores computation, I would to clarify a bit how the scores are calculated currently for beam search and why (as noted by many of you) this is not ideal at the moment.

Letâ€™s assume we are running the following beam search example:

```
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_ids = tokenizer("Today is a nice day", return_tensors="pt").input_ids
generated_outputs = gpt2.generate(input_ids, num_beams=2, num_return_sequences=2, output_scores=True, length_penalty=0)
```

**Note** that the `length_penalty`

is set to 0 here which will facilitate the computation afterward since the `sequence_scores`

are by default divided by the generated length of the sequnces since `length_penalty == 1`

.

Now we can see that `generated_outputs`

has the following keys:

```
['sequences', 'sequences_scores', 'scores']
```

`sequences`

- are just the token id sequences of the 2 most probably beams.

`sequence_scores`

- are the cumulative log probabilities of the two most probably beams. It can be formulated as a recursive formula: `sequence_scores[k]_i = sequence_score[k]_{i-1} + log_probs[i-1, :])_topk(2)[k] with `

sequence_score[k]_{i=start_token} = 0` (i being the time step).

`scores`

- now this is where it becomes confusing and where we should probably change the API. At the moment the `scores[i, j]`

are defined as log_probs[i-1, j] + sequence_score[j % vocab_size]_{i-1} whereas `j % vocab_size`

essentially defines the beam index.

Now a couple of things can be done to verify this:

```
generated_outputs.scores[-1].topk(2)
```

yields the same result as

```
generated_outputs.sequences_scores
```

Also the correct sequence id can be retrieved from the scores:

```
generated_outputs.scores[-1].topk(2).indices[0]
```

corresponds to the last generated tokens

```
generated_outputs.sequences[:, -1]
```

The problem now is that the scores donâ€™t really give any information about the probability of token j at time i which is what most people seem to be interested in.

So, Iâ€™ll start a PR to change this behavior.

Hi @patrickvonplaten, I second @shuyanzhâ€™s request! I have been trying to get the individual logprobs for the tokens that are in the best hypothesis, but itâ€™s not very clear how one can do that (even when the length penalty is 0).

I will both provide some explanation & answer a question on this topic. To my knowledge, when using the beam search to generate text, each of the elements in the tuple `generated_outputs.scores`

contains a matrix, where each row corresponds to each beam, stored at this step, while the values are the sum of log-probas of the previous sequence and the next token. Consequently, here is a snippet of code, which would theoretically allow getting all the beams stored at each step (I use bart-base from facebook checkpoint, although it should not matter):

```
# Initial beam, which consists of only </s> token with score 1
beams = [[([2], 1)]]
for score_matrix in generated.scores:
# Add a list, which will contain k most probable beams on this step
beams.append([])
# k most probable tokens
topk = score_matrix.view(-1).topk(4)
# Get the sum of the log probs (score) of the beam
topk_scores = topk.values.cpu().detach().numpy()
# Get the indices of the most probable tokens
topk_indices = topk.indices.cpu().detach().numpy()
# Get the tokens ids
topk_tokens = topk_indices % generated.scores[1].shape[1]
# Get the indices of the beams, to which these tokens belong
topk_beams = topk_indices // generated.scores[1].shape[1]
# For each of k most probable tokens
for token, nbeam, score in zip(topk_tokens, topk_beams, topk_scores):
# Take the beam, to which the token belongs
beam = list(beams[-2][nbeam][0])
# Add the token to the beam
beam.append(token)
# Add the beam and its score
beams[-1].append((beam, score))
```

Outputting the last element with `[(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(beam_data[0])), beam_data[1]) for beam_data in beams[-1]]`

results in:

```
[('</s><s>David Harris has called for the BBC to give back to books.</s></s>.</s>',
-8.348045),
('</s><s>David Harris has called for the BBC to do more to promote books and encourage.</s>',
-9.903482),
('</s><s>David Harris has called for the BBC to do more to promote books. awards to do',
-10.104368),
('</s><s>David Harris has called for the BBC to do more to promote books and authors authors.',
-10.360886)]
```

Since I used the default `length_penalty=1`

, to get the true score of the sequence, I do:

`generated.sequences_scores.item() * (len(generated_output.sequences[0]) - 1)`

. This score coincides with the score of the most probable beam along the last beams (beam 0), however, the sequences are different. The true sequence outputted with `tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(generated_output.sequences[0]))`

results in:

` </s><s>David Harris has called for the BBC to do more to promote books and authors.</s>`

Small scrutiny showed that this text, corresponding to the minimum score (i.e. most probable) (`</s><s>David Harris has called for the BBC to give back to books.</s></s>.</s>`

) coincides with the output of the model when `length_penalty=0`

, excluding the last 3 â€śrubbishâ€ť tokens. Here is the output when `length_penalty=0`

:

`</s><s>David Harris has called for the BBC to give back to books.</s>`

I hope I am missing something here, still it seems pretty vague for me.

May be of interest to @patrickvonplaten

I eventually managed to solve this problem, @Aktsvigun.

I found it simpler to use `length_penalty=0`

.

I created a tensor which keeps the scores for the beams that are being updated (similar to what you get w/ the `input_ids`

):

`running_scores = torch.cat([running_scores[beam_idx, :], beam_scores.unsqueeze(-1)], dim=-1)`

Then, when we add the hypothesis of the finished beams in `beam_scorer.process`

, I also send the `running_scores`

. Finally, I concatenate the sumlogprobs (`next_score`

) to the `running_scores`

. Then, when you get the most likely hypothesis in `finalize`

, the tuple contains the `running_scores`

.

To get the logprobs for each token, one would just need to get the consecutive increments (negative here) in `running_scores`

.

Hi @nunonmg, thanks for your solution! Still, `length_penalty=0`

prompts the model to generate shorter sequences, which may negatively affect their quality. Namely, in my example, `length_penalty=0`

results in a sequence `David Harris has called for the BBC to give back to books.`

, while the summary generated `length_penalty=1`

is `David Harris has called for the BBC to do more to promote books and authors.`

, which is longer and more accurate.

Yes, one can use `length_penalty=0`

just for confirmation purposes. As I am using the `beam_scores`

, these are the cumulative sums (as if `length_penalty=0`

). The `length_penalty`

is only used when you compute the score of the finished hypothesis. Thus, if you use the setting that I mentioned, the final beam score would be the last token score divided by the length of the hypothesis.

Thank you! But why are the scores the same in my example, given that they â€ścorrespondâ€ť to different texts? In addition, how can I â€śrestoreâ€ť the correct sequence having `length_penalty=0`

?

Looks like the pull request is here: https://github.com/huggingface/transformers/pull/14654 and is implemented in transformers v4.16.0

Can you please explain the scores returned in `generate`

in details. In particular, when we use a batch_size > 1.

Why applying `argmax()`

on `scores`

does not give the same thing as in `sequences`

?

With `batch_size > 1`

, why the scores shape is not **(batch_size, beam_nums, vocab_len)** instead of **(batch_size*beam_nums, vocab_len)**. It is really so confused.

```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
unk_index = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
eos_index = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
model.resize_token_embeddings(len(tokenizer))
model.to("cuda")
# sequences
seq1 = "summarize: I am confused! I am confused"
seq2 = "summarize: why generate does not work with batch_size >1"
# encoding input and attention mask
encoding = tokenizer(
[seq1, seq2],
padding="longest",
max_length=128,
truncation=True,
return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids.to("cuda"), encoding.attention_mask.to("cuda")
output = model.generate(input_ids,
max_length=64,
early_stopping=False, # to get len(scores) = sequences max_length
num_beams=4,
do_sample=False,
output_scores=True,
no_repeat_ngram_size=4,
return_dict_in_generate=True,
num_return_sequences=1)
output.sequences
tokenizer.batch_decode(output.sequences, skip_special_tokens=True)
# output.sequences
output.sequences
# tensor([[ 0, 27, 183, 11319, 55, 1, 0, 0, 0, 0,
# 0, 0, 0, 0],
# [ 0, 3806, 405, 59, 161, 28, 11587, 834, 7991, 2490,
# 536, 3, 5, 1]], device='cuda:0')
# How to get the above indices using output.scores ??
```

Could you elaborate on how you chose `gen_probs.prod(-1)`

as your method of obtaining an unique probability per sequence? Why not use `gen_probs.mean(-1)`

for the average probability score per sequence?