[Announcement] Generation: Get probabilities for generated output

Hey @fzyzcjy – if I understand your question correctly, you want the model output at each step of generation, i.e. the logits for all batches/beams at each step. You have access to that information in outputs.scores (see its docstring, e.g. for beam search).

The function .compute_transition_scores() is only needed to get the logits of the selected tokens, the tokens present in outputs.sequences.

@joaogante Hi thanks for your reply! However, I need only one probability for one whole sequence (instead of one probability for one token). For example, suppose the input of a BART model is an article, and I am doing the summarization task, and I call generate using beam search with num_beams=3. Then, the model will output 3 sentences, say, “I am sentence one” / “I am sentence two” / “I am sentence three”. Now I want to have 3 float numbers, representing the probability of each sentence. For example, the first floating number should represent P("I am sentence one" | the input article). I do not need things like P("I" | the input article) or P("I am" | the input article) or P("I am sentence" | the input article).

@fzyzcjy gotcha. In that case, yes, the script you shared would be a way to do it (and yes, with normalize_logits=True) – probabilities will be a tensor with shape (batch_size*num_return_sequences,) and its contents must be <= 1.0 when length_penalty==0.0. After you apply the length penalty, then you no longer have probabilities (hence the terminology score, instead of logits/probabilities).

If you are getting values > 1.0 with length_penalty==0.0, then it means we have an important bug to catch :bug: Can I please ask you to open an issue on github, share a script for reproducibility, and tag me (@gante)? :pray:

1 Like

@joaogante Thanks for your suggestions!

All single value are <=1.0. It is the sum of them that is >1.0. I seem to figure out that it is because of the nonzero length penalty.

@fzyzcjy I believe I rushed my previous answer, which I edited for clarity :slight_smile: (essentially yes, it is only a probability when length_penalty = 0.0)

@joaogante I got some more time to work on this issue again. If we wanted to calculate token log probs for AutoModelForSeq2SeqLM (I used flan t5), do the pairings between the probabilities and the tokens have to be shifted as well? In this case, the shifting is done internally when the labels are shifted right for the decoder input, right? This is the biggest selling point for token log probs API because one has to get these pairings correctly for all architectures. I don’t get high logits for obvious words in our test sentences, so I suspect the code I provided is still incorrect. Any ideas about what am I doing wrong?

Hey @vblagoje – you also need to shift the output logits by one in seq2seq models, if you use similar code. The logic is the same: the logits are always with respect to the next token :slight_smile:

Hello guys,

If I have implemented the following code:

tokens = model.generate(
input_ids=token[‘input_ids’],
attention_mask=token[‘attention_mask’],
num_beams=2,
num_return_sequences=2,
return_dict_in_generate=True,
output_scores=True,
renormalize_logits=True,
early_stopping=True,
max_new_tokens=750
)

transition_scores = model.compute_transition_scores(tokens.sequences, tokens.scores, normalize_logits=True)
for i in range(len(transition_scores)):
prob=np.exp(transition_scores.numpy())[i].sum(axis=0)
print(prob/len(transition_scores[I]))

The printed probability (the one in the for loop) is supposed to be the probability of the generated output, which is a stream of tokens. Thus, if the beam_size=2, I would get (after running the code) something like this:

0.917925999082368 → Cumulative probability of the tokens generated for the first beam
0.8858097997205011 → Cumulative probability of the tokens generated for the second beam

Is my interpretation correct?

@mastro1996 Two important details you should fix to get a correct interpretation:

  1. Because you are using beam search, in model.compute_transition_scores you should also pass beam_indices=tokens.beam_indices. With beam search, the contents of tokens.scores are scrambled, and beam_indices are required to de-scramble them.
  2. Language modeling is a causal problem, so it makes more sense to evaluate the product of the probabilities (and not the sum). The product of the token probabilities corresponds to the probability of the sequence!
1 Like

Hey,
it is giving probability for the sentence generated out of model right?

@Ranjittechie I would need the full context (what is probabilities?) to answer your question :slight_smile:

modified from “example 1” and “example 2”

outputs = model.generate(inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
transition_scores = model.compute_transition_scores(
outputs.sequences, outputs.scores, normalize_logits=True # NOTE normalize SHOULD be true?
)
output_length = inputs.input_ids.shape[1] + np.sum(transition_scores.numpy() < 0, axis=1)
length_penalty = model.generation_config.length_penalty
probabilities = torch.exp(transition_scores.sum(axis=1) / (output_length
length_penalty))

here you can get it @joaogante

@joaogante this one i refered

@Ranjittechie probabilities = torch.exp(transition_scores.sum(axis=1) (without the length_penalty) would be the closest you can get to the probability of the generated sequences.

When using beam search: With the length_penalty division, it is a score. Kinda like a probability, but without the guarantee that the probability of all possible generated sequences sum to 1. Be mindful that beam search picks the outputs with the highest score, and not with the highest probability – this allows you to control how long you want your outputs to be.

Please note that length_penalty is NOT used outside beam search.

And yes, if you are interested in probabilities, normalize_logits should be True.

1 Like

here i am not using beam search right?
@joaogante
and it would be helpful if you could just share a piece of code here which returns probabilities of generated sequences!

also a piece code which uses beam search to generate sequences!

cause i tried with custome traing and its not generating that much correct responses!

is there any specific data format i should follow ?

currently using a text file containing question and answer format inside !

Hey @Ranjittechie – you can find the answer to all your questions and examples in our documentations and blog posts :slight_smile:

In your example you are not using beam search, so length_penalty is not used (I’ve edited my comment above for clarity).

Thank you @joaogante

Got the answer to my question !

Dropping a quick thank you note to everyone in this discussion.
I was really not sure about the difference between scores and transition_scores but having read this thread things became much clearer. Thank you again.

Why we need shifted logits for seq2seq? I think in seq2seq model, the output logits don’t need to be shifted as proven in the repo?transformers/modeling_t5.py at 68287689f2f0d8b7063c400230b3766987abf18d · huggingface/transformers · GitHub

I agree with that we may not need to do shift in seq2seq?