[Announcement] Generation: Get probabilities for generated output

Nice! This is not exposed in the hosted inference API, right?

@jas-ho let me answer in two parts:

  1. We favor inference endpoints (Inference Endpoints - Hugging Face) to inference API, as the former are much more flexible.
  2. Inference endpoints are fully flexible, so it is surely doable. The question is how complex it is to do it :slight_smile: I have no idea, so I will build a demo and share it here!
1 Like

@Seohyeong I agree that formalizing the definitions for those three things (scores, transition_scores, and sequence_scores) can be tricky. Especially because we, in transformers, have to ensure that what we build is backward compatible, which means that our naming of variables can become slightly inaccurate.

  1. scores are the UNNORMALIZED log probabilities (they will only be normalized if you pass renormalize_logits=True to .generate()). Sadly, we are stuck with unnormalized log probabilities to ensure backwards compatibility :frowning: It is a tuple containing one entry for each generated token. Each tuple member is a tensor containing the log probabilities from the model, for all words in the vocabulary. These log probabilities are gathered AFTER manipulating them with our logits processors (e.g. after setting the probability of certain words to 0, after applying top k, …);
  2. sequence_scores correspond to the sum of the log probabilities in scores. If length_penalty is not 0, sequence_scores will be divided by sequence_length**length_penalty;
  3. transition_scores contains scores for the tokens that were selected at generation time. You can set normalize_logits=True to ensure they are normalized at a token level (i.e. to ensure the sum of probabilities for all vocabulary at a given generation step is 1).

:warning: An important detail: because scores is unnormalized by default, it also means that sequence_scores is the sum of unnormalized scores, which can be undesirable. A workaround is to create transition_scores from compute_transition_scores with normalize_logits=True, from which you can recompute the sequence scores using normalized scores.

Hopefully we can correct this unintuitive behavior when we release the next major version (transformers v5.0) :pray:

3 Likes

Hey @Seohyeong here it is

def to_tokens_and_logprobs(model, tokenizer, input_texts):
  encoded_input_texts = tokenizer(input_texts, padding=True, return_tensors="pt")
  outputs = model(encoded_input_texts.input_ids, labels=encoded_input_texts.input_ids)
  probs = torch.log(outputs.logits.softmax(dim=-1)/100).detach()

  # collect the probability of the generated token
  # we need to add a dummy dim in the end to make gather work
  gen_probs = torch.gather(probs, 2, input_ids[:, :, None]).squeeze(-1)

  batch = []
  for input_sentence, input_probs in zip(input_ids, gen_probs):
    text_sequence = []
    for token, p in zip(input_sentence, input_probs):  
      if token not in tokenizer.all_special_ids:
        text_sequence.append((tokenizer.decode(token),p.item()))
    batch.append(text_sequence)    
  return batch

Then you would call this method with:

to_tokens_and_logprobs(model, tokenizer, ["Give me something", "Good morning", "Hello, how are you?"])

And get the tokens and probs pairs:

[[('Give', -13.964306831359863),
  ('me', -5.264768123626709),
  ('something', -6.42432165145874)],
 [('Good', -20.744667053222656), ('morning', -4.624837398529053)],
 [('Hello', -8.46367073059082),
  (',', -5.478875637054443),
  ('how', -6.255255222320557),
  ('are', -4.785802841186523),
  ('you', -4.666200160980225),
  ('?', -4.6729416847229)]]

It would be great if @joaogante could review this code snippet :slight_smile:

1 Like

Thanks for the explanation! I understand why certain things are implemented in such ways. I personally had a difficult time trying to find definitions of scores, sequence_scores, and transition_scores and how they’re calculated. It’d be great if this chuck of your explanation is added to the official documentation.

Also, I do run into RuntimeError by tensor size mismatch by running compute_transition_scores() with a batch size > 1. Does this method only support input with batch size 1 for now?

Hey @vblagoje :wave: I believe you forgot one tiny detail, other than that looks like a solid implementation! In a nutshell, these models return the probabilities for the next token, which means that logits[batch_idx, seq_idx, vocab_idx] actually contains the logits corresponding to input_ids[batch_idx, seq_idx + 1]. This implies that the pairing between the probabilities and the tokens has to be shifted by one :slight_smile:

Here is a modified script:

from pprint import pprint
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


def to_tokens_and_logprobs(model, tokenizer, input_texts):
    input_ids = tokenizer(input_texts, padding=True, return_tensors="pt").input_ids
    outputs = model(input_ids)
    probs = torch.log_softmax(outputs.logits, dim=-1).detach()

    # collect the probability of the generated token -- probability at index 0 corresponds to the token at index 1
    probs = probs[:, :-1, :]
    input_ids = input_ids[:, 1:]
    gen_probs = torch.gather(probs, 2, input_ids[:, :, None]).squeeze(-1)

    batch = []
    for input_sentence, input_probs in zip(input_ids, gen_probs):
        text_sequence = []
        for token, p in zip(input_sentence, input_probs):
            if token not in tokenizer.all_special_ids:
                text_sequence.append((tokenizer.decode(token), p.item()))
        batch.append(text_sequence)
    return batch


tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id

input_texts = ["One plus one is two", "Good morning", "Hello, how are you?"]

batch = to_tokens_and_logprobs(model, tokenizer, input_texts)
pprint(batch)

which yields

[[('One', -5.882715702056885),
  (' plus', -9.785109519958496),
  (' one', -0.7229145169258118),
  (' is', -2.494063377380371),
  (' two', -6.137458324432373)],
 [('Good', -7.5790300369262695), (' morning', -1.826707124710083)],
 [(',', -2.343151807785034),
  (' how', -4.339702606201172),
  (' are', -2.6824729442596436),
  (' you', -0.4109247326850891),
  ('?', -1.8950778245925903)]]

Notice how high the logits for certain obvious tokens are, like morning, you, or ?! Checking these tokens is always a good sanity check :smiley: Also, look at the last sentence: there are no logits for the first token. If you want the logits for that token, you need to add extra padding on the left, so that the first text token is not the actual first token fed to the model.

1 Like

@Seohyeong yeah, we have a very big backlog and improving the docstrings is part of it. Our team is actually quite small (I’m the only one working on generate at the moment), so PRs with improvements are very welcome :pray:

Can you open an issue on GitHub with what you are experiencing? ( compute_transition_scores() with a batch size > 1)

Excellent, you are right, thank you @joaogante

Hi, thanks for the PR and explanations! I am still a bit confused :confused: I want to get the probability of each output under (for example) a beam search. In other words, P_\theta(output|input) where \theta is language model parameters, output is the output of the LM, and input is the input. Since it is nothing but a conditional probability, I guess this is clearly defined. Then, how should I get this value?

My current guess:

# modified from "example 1" and "example 2"
outputs = model.generate(**inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True # NOTE normalize SHOULD be true?
)
output_length = inputs.input_ids.shape[1] + np.sum(transition_scores.numpy() < 0, axis=1)
length_penalty = model.generation_config.length_penalty
probabilities = torch.exp(transition_scores.sum(axis=1) / (output_length**length_penalty))
# what I want is the `probabilities` (correct?)

Looks it is wrong, because I can get probabilities that sum up to larger than 1

Hey @fzyzcjy – if I understand your question correctly, you want the model output at each step of generation, i.e. the logits for all batches/beams at each step. You have access to that information in outputs.scores (see its docstring, e.g. for beam search).

The function .compute_transition_scores() is only needed to get the logits of the selected tokens, the tokens present in outputs.sequences.

@joaogante Hi thanks for your reply! However, I need only one probability for one whole sequence (instead of one probability for one token). For example, suppose the input of a BART model is an article, and I am doing the summarization task, and I call generate using beam search with num_beams=3. Then, the model will output 3 sentences, say, “I am sentence one” / “I am sentence two” / “I am sentence three”. Now I want to have 3 float numbers, representing the probability of each sentence. For example, the first floating number should represent P("I am sentence one" | the input article). I do not need things like P("I" | the input article) or P("I am" | the input article) or P("I am sentence" | the input article).

@fzyzcjy gotcha. In that case, yes, the script you shared would be a way to do it (and yes, with normalize_logits=True) – probabilities will be a tensor with shape (batch_size*num_return_sequences,) and its contents must be <= 1.0 when length_penalty==0.0. After you apply the length penalty, then you no longer have probabilities (hence the terminology score, instead of logits/probabilities).

If you are getting values > 1.0 with length_penalty==0.0, then it means we have an important bug to catch :bug: Can I please ask you to open an issue on github, share a script for reproducibility, and tag me (@gante)? :pray:

1 Like

@joaogante Thanks for your suggestions!

All single value are <=1.0. It is the sum of them that is >1.0. I seem to figure out that it is because of the nonzero length penalty.

@fzyzcjy I believe I rushed my previous answer, which I edited for clarity :slight_smile: (essentially yes, it is only a probability when length_penalty = 0.0)

@joaogante I got some more time to work on this issue again. If we wanted to calculate token log probs for AutoModelForSeq2SeqLM (I used flan t5), do the pairings between the probabilities and the tokens have to be shifted as well? In this case, the shifting is done internally when the labels are shifted right for the decoder input, right? This is the biggest selling point for token log probs API because one has to get these pairings correctly for all architectures. I don’t get high logits for obvious words in our test sentences, so I suspect the code I provided is still incorrect. Any ideas about what am I doing wrong?

Hey @vblagoje – you also need to shift the output logits by one in seq2seq models, if you use similar code. The logic is the same: the logits are always with respect to the next token :slight_smile:

Hello guys,

If I have implemented the following code:

tokens = model.generate(
input_ids=token[‘input_ids’],
attention_mask=token[‘attention_mask’],
num_beams=2,
num_return_sequences=2,
return_dict_in_generate=True,
output_scores=True,
renormalize_logits=True,
early_stopping=True,
max_new_tokens=750
)

transition_scores = model.compute_transition_scores(tokens.sequences, tokens.scores, normalize_logits=True)
for i in range(len(transition_scores)):
prob=np.exp(transition_scores.numpy())[i].sum(axis=0)
print(prob/len(transition_scores[I]))

The printed probability (the one in the for loop) is supposed to be the probability of the generated output, which is a stream of tokens. Thus, if the beam_size=2, I would get (after running the code) something like this:

0.917925999082368 → Cumulative probability of the tokens generated for the first beam
0.8858097997205011 → Cumulative probability of the tokens generated for the second beam

Is my interpretation correct?

@mastro1996 Two important details you should fix to get a correct interpretation:

  1. Because you are using beam search, in model.compute_transition_scores you should also pass beam_indices=tokens.beam_indices. With beam search, the contents of tokens.scores are scrambled, and beam_indices are required to de-scramble them.
  2. Language modeling is a causal problem, so it makes more sense to evaluate the product of the probabilities (and not the sum). The product of the token probabilities corresponds to the probability of the sequence!
1 Like

Hey,
it is giving probability for the sentence generated out of model right?

@Ranjittechie I would need the full context (what is probabilities?) to answer your question :slight_smile: