@patrickvonplaten does return_dict_in_generate and output_scores works only for do_sample=True? or i can use it with beam_search and top_k and top_p?
return_dict_in_generate
can be used with all generate methods including beam_search
I asked a question regarding the shape of scores
returned from the generate() function. Why is the length of the output_scores
always +1 longer than the max_length
in the output of generate()
?
Can we take gradients with respect to these generated logits?
Hi,
Just wanted to link this Big generate()
refactor - Transformers - Hugging Face Forums for people who are looking into this in the future! I recently required gradients computed with respect to the logits but was unable to do so until I found the above link.
This discussion: Question about greedy_search - Transformers - Hugging Face Forums was also useful and provided a more concrete example to the above.
Thank you.
I am trying to apply the probability generation for GPT-J but the model.generate() function returns a torch.tensor, meaning there is no attribute generated_outputs.scores
any ideas for a solution?
@patrickvonplaten is it not the case that the history for a beam element i
at time t-1
will not generally be a prefix of the history of the element i
at time t
(because at each time step we sort the elements of the beam transformers/generation_utils.py at d83b0e0c079f0826d186270a86622ff5f1efd9c1 · huggingface/transformers · GitHub)⦠and therefore the above gather operation will not actually do what is intended here?
Having seen quite some issue now regarding the beam scores computation, I would to clarify a bit how the scores are calculated currently for beam search and why (as noted by many of you) this is not ideal at the moment.
Letās assume we are running the following beam search example:
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_ids = tokenizer("Today is a nice day", return_tensors="pt").input_ids
generated_outputs = gpt2.generate(input_ids, num_beams=2, num_return_sequences=2, output_scores=True, length_penalty=0)
Note that the length_penalty
is set to 0 here which will facilitate the computation afterward since the sequence_scores
are by default divided by the generated length of the sequnces since length_penalty == 1
.
Now we can see that generated_outputs
has the following keys:
['sequences', 'sequences_scores', 'scores']
sequences
- are just the token id sequences of the 2 most probably beams.
sequence_scores
- are the cumulative log probabilities of the two most probably beams. It can be formulated as a recursive formula: sequence_scores[k]_i = sequence_score[k]_{i-1} + log_probs[i-1, :])_topk(2)[k] with
sequence_score[k]_{i=start_token} = 0` (i being the time step).
scores
- now this is where it becomes confusing and where we should probably change the API. At the moment the scores[i, j]
are defined as log_probs[i-1, j] + sequence_score[j % vocab_size]_{i-1} whereas j % vocab_size
essentially defines the beam index.
Now a couple of things can be done to verify this:
generated_outputs.scores[-1].topk(2)
yields the same result as
generated_outputs.sequences_scores
Also the correct sequence id can be retrieved from the scores:
generated_outputs.scores[-1].topk(2).indices[0]
corresponds to the last generated tokens
generated_outputs.sequences[:, -1]
The problem now is that the scores donāt really give any information about the probability of token j at time i which is what most people seem to be interested in.
So, Iāll start a PR to change this behavior.
Hi @patrickvonplaten, I second @shuyanzhās request! I have been trying to get the individual logprobs for the tokens that are in the best hypothesis, but itās not very clear how one can do that (even when the length penalty is 0).
I will both provide some explanation & answer a question on this topic. To my knowledge, when using the beam search to generate text, each of the elements in the tuple generated_outputs.scores
contains a matrix, where each row corresponds to each beam, stored at this step, while the values are the sum of log-probas of the previous sequence and the next token. Consequently, here is a snippet of code, which would theoretically allow getting all the beams stored at each step (I use bart-base from facebook checkpoint, although it should not matter):
# Initial beam, which consists of only </s> token with score 1
beams = [[([2], 1)]]
for score_matrix in generated.scores:
# Add a list, which will contain k most probable beams on this step
beams.append([])
# k most probable tokens
topk = score_matrix.view(-1).topk(4)
# Get the sum of the log probs (score) of the beam
topk_scores = topk.values.cpu().detach().numpy()
# Get the indices of the most probable tokens
topk_indices = topk.indices.cpu().detach().numpy()
# Get the tokens ids
topk_tokens = topk_indices % generated.scores[1].shape[1]
# Get the indices of the beams, to which these tokens belong
topk_beams = topk_indices // generated.scores[1].shape[1]
# For each of k most probable tokens
for token, nbeam, score in zip(topk_tokens, topk_beams, topk_scores):
# Take the beam, to which the token belongs
beam = list(beams[-2][nbeam][0])
# Add the token to the beam
beam.append(token)
# Add the beam and its score
beams[-1].append((beam, score))
Outputting the last element with [(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(beam_data[0])), beam_data[1]) for beam_data in beams[-1]]
results in:
[('</s><s>David Harris has called for the BBC to give back to books.</s></s>.</s>',
-8.348045),
('</s><s>David Harris has called for the BBC to do more to promote books and encourage.</s>',
-9.903482),
('</s><s>David Harris has called for the BBC to do more to promote books. awards to do',
-10.104368),
('</s><s>David Harris has called for the BBC to do more to promote books and authors authors.',
-10.360886)]
Since I used the default length_penalty=1
, to get the true score of the sequence, I do:
generated.sequences_scores.item() * (len(generated_output.sequences[0]) - 1)
. This score coincides with the score of the most probable beam along the last beams (beam 0), however, the sequences are different. The true sequence outputted with tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(generated_output.sequences[0]))
results in:
</s><s>David Harris has called for the BBC to do more to promote books and authors.</s>
Small scrutiny showed that this text, corresponding to the minimum score (i.e. most probable) (</s><s>David Harris has called for the BBC to give back to books.</s></s>.</s>
) coincides with the output of the model when length_penalty=0
, excluding the last 3 ārubbishā tokens. Here is the output when length_penalty=0
:
</s><s>David Harris has called for the BBC to give back to books.</s>
I hope I am missing something here, still it seems pretty vague for me.
May be of interest to @patrickvonplaten
I eventually managed to solve this problem, @Aktsvigun.
I found it simpler to use length_penalty=0
.
I created a tensor which keeps the scores for the beams that are being updated (similar to what you get w/ the input_ids
):
running_scores = torch.cat([running_scores[beam_idx, :], beam_scores.unsqueeze(-1)], dim=-1)
Then, when we add the hypothesis of the finished beams in beam_scorer.process
, I also send the running_scores
. Finally, I concatenate the sumlogprobs (next_score
) to the running_scores
. Then, when you get the most likely hypothesis in finalize
, the tuple contains the running_scores
.
To get the logprobs for each token, one would just need to get the consecutive increments (negative here) in running_scores
.
Hi @nunonmg, thanks for your solution! Still, length_penalty=0
prompts the model to generate shorter sequences, which may negatively affect their quality. Namely, in my example, length_penalty=0
results in a sequence David Harris has called for the BBC to give back to books.
, while the summary generated length_penalty=1
is David Harris has called for the BBC to do more to promote books and authors.
, which is longer and more accurate.
Yes, one can use length_penalty=0
just for confirmation purposes. As I am using the beam_scores
, these are the cumulative sums (as if length_penalty=0
). The length_penalty
is only used when you compute the score of the finished hypothesis. Thus, if you use the setting that I mentioned, the final beam score would be the last token score divided by the length of the hypothesis.
Thank you! But why are the scores the same in my example, given that they ācorrespondā to different texts? In addition, how can I ārestoreā the correct sequence having length_penalty=0
?
Looks like the pull request is here: https://github.com/huggingface/transformers/pull/14654 and is implemented in transformers v4.16.0
Can you please explain the scores returned in generate
in details. In particular, when we use a batch_size > 1.
Why applying argmax()
on scores
does not give the same thing as in sequences
?
With batch_size > 1
, why the scores shape is not (batch_size, beam_nums, vocab_len) instead of (batch_size*beam_nums, vocab_len). It is really so confused.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
unk_index = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
eos_index = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
model.resize_token_embeddings(len(tokenizer))
model.to("cuda")
# sequences
seq1 = "summarize: I am confused! I am confused"
seq2 = "summarize: why generate does not work with batch_size >1"
# encoding input and attention mask
encoding = tokenizer(
[seq1, seq2],
padding="longest",
max_length=128,
truncation=True,
return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids.to("cuda"), encoding.attention_mask.to("cuda")
output = model.generate(input_ids,
max_length=64,
early_stopping=False, # to get len(scores) = sequences max_length
num_beams=4,
do_sample=False,
output_scores=True,
no_repeat_ngram_size=4,
return_dict_in_generate=True,
num_return_sequences=1)
output.sequences
tokenizer.batch_decode(output.sequences, skip_special_tokens=True)
# output.sequences
output.sequences
# tensor([[ 0, 27, 183, 11319, 55, 1, 0, 0, 0, 0,
# 0, 0, 0, 0],
# [ 0, 3806, 405, 59, 161, 28, 11587, 834, 7991, 2490,
# 536, 3, 5, 1]], device='cuda:0')
# How to get the above indices using output.scores ??
Could you elaborate on how you chose gen_probs.prod(-1)
as your method of obtaining an unique probability per sequence? Why not use gen_probs.mean(-1)
for the average probability score per sequence?
Hey everyone
We have released a new function to solve this problem, have a look at this thread: [Announcement] Generation: Get probabilities for generated output
Since some of the snippets at the start of this thread no longer match our API, Iād like to ask for new questions/comments to be posted on the thread Iāve linked above.
(@danielcabal you might find an answer to your question in the post I linked )