[Announcement] Generation: Get probabilities for generated output

Hey everyone :wave:

We have just merged a PR that exposes a new function related to .generate(), compute_transition_scores. With this function, you can quickly solve any problem that requires the probabilities of generated tokens, for any generation strategy. It is also nicely documented – see here.

How can this function help you? Let me give you two simple examples!

Example 1 -- print the probabilities for the output generated by Greedy Search
from transformers import GPT2Tokenizer, AutoModelForCausalLM
import numpy as np

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs = tokenizer(["Today is"], return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)

input_length = inputs.input_ids.shape[1]
generated_tokens = outputs.sequences[:, input_length:]
for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string | logits | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.4f} | {np.exp(score.numpy()):.2%}")
# Expected output:
#|   262 |  the     | -1.4136 | 24.33%
#|  1110 |  day     | -2.6089 | 7.36%
#|   618 |  when    | -2.0096 | 13.40%
#|   356 |  we      | -1.8593 | 15.58%
#|   460 |  can     | -2.5083 | 8.14%
Example 2 -- recompute the sequence scores from Beam Search
from transformers import GPT2Tokenizer, AutoModelForCausalLM
import numpy as np

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs = tokenizer(["Today is"], return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=5,
    num_beams=4,
    num_return_sequences=4,
    return_dict_in_generate=True,
    output_scores=True,
)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
)
# If you sum the generated tokens' scores and apply the length penalty, you'll get the sequence scores.
# Tip: set `normalize_logits=True` to recompute the scores from the normalized logits.
output_length = np.sum(transition_scores.numpy() < 0, axis=1)
length_penalty = model.generation_config.length_penalty
reconstructed_scores = transition_scores.sum(axis=1) / (output_length**length_penalty)
print(np.allclose(outputs.sequences_scores, reconstructed_scores))
# Expected output:
#True

:fire: There is also an interactive demo here that makes use of these functionalities to color-code generated text according to the probabilities :fire:

Let me know if you have comments, questions, and/or suggestions! :hugs:

(P.S.: This new post is also meant as a replacement for an older one, which contains stale examples. Let’s keep all further discussion around generated token probabilities here!)

15 Likes

@joaogante coming from the log probabilities thread here as per your recommendation. This is a great contribution! While at it - do you plan to add utility functions to calculate log probabilities for each token if no scores are given but just token ids? More generally - given a sentence, calculate log probabilities for each token in that sentence.

Hi @vblagoje :wave:

That was not on my plans, but you’re not the first person asking for them… so I’ll include it there! In summary, I think .generate() can be modified to also return the scores of the input, for decoder-only models like the GPT family or OPT.

Hey @joaogante,

Thank you for the prompt response. However, I meant more like a general utility method to calculate log probs for the given text regardless of generate. Of course, one can do this “by hand” by applying softmax on logits, but then again, we have to check the docs, do we need to shift inputs, need to invoke gather to get logits on individual tokens, whether calculations are different for this architecture from that architecture, etc etc. Why not have some utility method that, for a given model and text, returns tokens and their logprobs?

1 Like

Getting an AttributeError: 'GPT2LMHeadModel' object has no attribute 'compute_transition_scores' error while trying to run. Installed transformers library using pip install git+https://github.com/huggingface/transformers

1 Like

Hey @rsaha :wave: perhaps you need to add the --upgrade flag (i.e. pip install --upgrade git+https://github.com/huggingface/transformers.git)

If the problem persists, please open an issue in transformers :pray:

@vblagoje I see. That is easy to build, added that to my to do list :slight_smile:

1 Like

Awesome, @joaogante I have my own implementation that works well for the model we currently use, but it would undoubtedly be better to use an official API one day. I think others will find it helpful as well especially now that providers like OpenAI and Cohere offer token logprobs as part of the standard API. All the best

2 Likes

Worked with the upgrade flag!

@joaogante Thanks for your contribution. I’ve been looking for this and it works well with T5 with the latest tag (v 4.26.0). Can you double check if my understanding is correct?

transition_scores_{i} = log(P(y_i|y_1, …, y_{i-1}, x)),
sequence_scores (from .generate()) = (1/t)* sum_{i=1}^{t} log(P(y_i|y_1, …, y_{i-1}, x))

Therefore, as stated in the documentation, the length normalized sum of transition_scores equals the sequence_scores? (but the documentation states that it’s only guaranteed when normalize_logits=False. Does this mean the original implementation of sequence_scores does not normalize logits across the vocab dimension?)

Also, my understanding of scores from the .generate() is still shaky. Could anyone share their understanding on how scores from the .generate() method is different from transition_scores from .compute_transition_scores()?

1 Like

Hi @vblagoje. I’ve been looking for such implementation. Would you be willing to share yours?

Nice! This is not exposed in the hosted inference API, right?

@jas-ho let me answer in two parts:

  1. We favor inference endpoints (Inference Endpoints - Hugging Face) to inference API, as the former are much more flexible.
  2. Inference endpoints are fully flexible, so it is surely doable. The question is how complex it is to do it :slight_smile: I have no idea, so I will build a demo and share it here!
1 Like

@Seohyeong I agree that formalizing the definitions for those three things (scores, transition_scores, and sequence_scores) can be tricky. Especially because we, in transformers, have to ensure that what we build is backward compatible, which means that our naming of variables can become slightly inaccurate.

  1. scores are the UNNORMALIZED log probabilities (they will only be normalized if you pass renormalize_logits=True to .generate()). Sadly, we are stuck with unnormalized log probabilities to ensure backwards compatibility :frowning: It is a tuple containing one entry for each generated token. Each tuple member is a tensor containing the log probabilities from the model, for all words in the vocabulary. These log probabilities are gathered AFTER manipulating them with our logits processors (e.g. after setting the probability of certain words to 0, after applying top k, …);
  2. sequence_scores correspond to the sum of the log probabilities in scores. If length_penalty is not 0, sequence_scores will be divided by sequence_length**length_penalty;
  3. transition_scores contains scores for the tokens that were selected at generation time. You can set normalize_logits=True to ensure they are normalized at a token level (i.e. to ensure the sum of probabilities for all vocabulary at a given generation step is 1).

:warning: An important detail: because scores is unnormalized by default, it also means that sequence_scores is the sum of unnormalized scores, which can be undesirable. A workaround is to create transition_scores from compute_transition_scores with normalize_logits=True, from which you can recompute the sequence scores using normalized scores.

Hopefully we can correct this unintuitive behavior when we release the next major version (transformers v5.0) :pray:

8 Likes

Hey @Seohyeong here it is

def to_tokens_and_logprobs(model, tokenizer, input_texts):
  encoded_input_texts = tokenizer(input_texts, padding=True, return_tensors="pt")
  outputs = model(encoded_input_texts.input_ids, labels=encoded_input_texts.input_ids)
  probs = torch.log(outputs.logits.softmax(dim=-1)/100).detach()

  # collect the probability of the generated token
  # we need to add a dummy dim in the end to make gather work
  gen_probs = torch.gather(probs, 2, input_ids[:, :, None]).squeeze(-1)

  batch = []
  for input_sentence, input_probs in zip(input_ids, gen_probs):
    text_sequence = []
    for token, p in zip(input_sentence, input_probs):  
      if token not in tokenizer.all_special_ids:
        text_sequence.append((tokenizer.decode(token),p.item()))
    batch.append(text_sequence)    
  return batch

Then you would call this method with:

to_tokens_and_logprobs(model, tokenizer, ["Give me something", "Good morning", "Hello, how are you?"])

And get the tokens and probs pairs:

[[('Give', -13.964306831359863),
  ('me', -5.264768123626709),
  ('something', -6.42432165145874)],
 [('Good', -20.744667053222656), ('morning', -4.624837398529053)],
 [('Hello', -8.46367073059082),
  (',', -5.478875637054443),
  ('how', -6.255255222320557),
  ('are', -4.785802841186523),
  ('you', -4.666200160980225),
  ('?', -4.6729416847229)]]

It would be great if @joaogante could review this code snippet :slight_smile:

5 Likes

Thanks for the explanation! I understand why certain things are implemented in such ways. I personally had a difficult time trying to find definitions of scores, sequence_scores, and transition_scores and how they’re calculated. It’d be great if this chuck of your explanation is added to the official documentation.

Also, I do run into RuntimeError by tensor size mismatch by running compute_transition_scores() with a batch size > 1. Does this method only support input with batch size 1 for now?

Hey @vblagoje :wave: I believe you forgot one tiny detail, other than that looks like a solid implementation! In a nutshell, these models return the probabilities for the next token, which means that logits[batch_idx, seq_idx, vocab_idx] actually contains the logits corresponding to input_ids[batch_idx, seq_idx + 1]. This implies that the pairing between the probabilities and the tokens has to be shifted by one :slight_smile:

Here is a modified script:

from pprint import pprint
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


def to_tokens_and_logprobs(model, tokenizer, input_texts):
    input_ids = tokenizer(input_texts, padding=True, return_tensors="pt").input_ids
    outputs = model(input_ids)
    probs = torch.log_softmax(outputs.logits, dim=-1).detach()

    # collect the probability of the generated token -- probability at index 0 corresponds to the token at index 1
    probs = probs[:, :-1, :]
    input_ids = input_ids[:, 1:]
    gen_probs = torch.gather(probs, 2, input_ids[:, :, None]).squeeze(-1)

    batch = []
    for input_sentence, input_probs in zip(input_ids, gen_probs):
        text_sequence = []
        for token, p in zip(input_sentence, input_probs):
            if token not in tokenizer.all_special_ids:
                text_sequence.append((tokenizer.decode(token), p.item()))
        batch.append(text_sequence)
    return batch


tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id

input_texts = ["One plus one is two", "Good morning", "Hello, how are you?"]

batch = to_tokens_and_logprobs(model, tokenizer, input_texts)
pprint(batch)

which yields

[[('One', -5.882715702056885),
  (' plus', -9.785109519958496),
  (' one', -0.7229145169258118),
  (' is', -2.494063377380371),
  (' two', -6.137458324432373)],
 [('Good', -7.5790300369262695), (' morning', -1.826707124710083)],
 [(',', -2.343151807785034),
  (' how', -4.339702606201172),
  (' are', -2.6824729442596436),
  (' you', -0.4109247326850891),
  ('?', -1.8950778245925903)]]

Notice how high the logits for certain obvious tokens are, like morning, you, or ?! Checking these tokens is always a good sanity check :smiley: Also, look at the last sentence: there are no logits for the first token. If you want the logits for that token, you need to add extra padding on the left, so that the first text token is not the actual first token fed to the model.

8 Likes

@Seohyeong yeah, we have a very big backlog and improving the docstrings is part of it. Our team is actually quite small (I’m the only one working on generate at the moment), so PRs with improvements are very welcome :pray:

Can you open an issue on GitHub with what you are experiencing? ( compute_transition_scores() with a batch size > 1)

1 Like

Excellent, you are right, thank you @joaogante

Hi, thanks for the PR and explanations! I am still a bit confused :confused: I want to get the probability of each output under (for example) a beam search. In other words, P_\theta(output|input) where \theta is language model parameters, output is the output of the LM, and input is the input. Since it is nothing but a conditional probability, I guess this is clearly defined. Then, how should I get this value?

My current guess:

# modified from "example 1" and "example 2"
outputs = model.generate(**inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True # NOTE normalize SHOULD be true?
)
output_length = inputs.input_ids.shape[1] + np.sum(transition_scores.numpy() < 0, axis=1)
length_penalty = model.generation_config.length_penalty
probabilities = torch.exp(transition_scores.sum(axis=1) / (output_length**length_penalty))
# what I want is the `probabilities` (correct?)

Looks it is wrong, because I can get probabilities that sum up to larger than 1