[Announcement] Generation: Get probabilities for generated output

Hello, I was wondering to which extent the probabilities for tool use can be extracted using the approach described in this thread.

I am using a simple tool call to the Cohere Command R+ model. The two tools that are being passed to the model are: 1) a web search and 2) a call to another LLM. Whenever I prompt it with a knowledge question such as “What is the biggest penguin”, the output is a tool selection and the query parameters for an internet search. In this case, the probabilities for each token are 100% except for two tokens of the input query. My assumption would be that whenever I prompt the LLM with a message that can not be solved with the available tools (and the default “directly answer” which uses the LLM to answer the query) the confidence would be significantly lower.

However, if I add 2 tools for multiplying and summing integers and prompt with the query “what is the temperature for today” I still get high confidence for the “directly answer” action. Is my assumption wrong that the probabilities as defined above cannot be used to estimate the probabilities for a tool call?

Token output with probabilities
token token string logits probability
9814 Action 0.0000 100.00%
33 : 0.0000 100.00%
15080 ``` 0.0000 100.00%
6329 json 0.0000 100.00%
206 0.0000 100.00%
66 [ 0.0000 100.00%
1856 0.0000 100.00%
1936 { 0.0000 100.00%
1890 0.0000 100.00%
1789 " 0.0000 100.00%
22018 tool 0.0000 100.00%
70 _ 0.0000 100.00%
2769 name 0.0000 100.00%
2209 ": 0.0000 100.00%
1789 " 0.0000 100.00%
6903 web -0.0003 99.97%
70 _ 0.0000 100.00%
9363 search 0.0000 100.00%
2040 ", 0.0000 100.00%
1890 0.0000 100.00%
1789 " 0.0000 100.00%
21508 parameters 0.0000 100.00%
2209 ": 0.0000 100.00%
1936 { 0.0000 100.00%
2087 0.0000 100.00%
1789 " 0.0000 100.00%
8417 query 0.0000 100.00%
2209 ": 0.0000 100.00%
1789 " 0.0000 100.00%
214226 biggest -0.8203 44.03%
211829 penguin -0.0036 99.64%
7754 species 0.0000 100.00%
9 " 0.0000 100.00%
1890 0.0000 100.00%
2046 } 0.0000 100.00%
1856 0.0000 100.00%
2046 } 0.0000 100.00%
206 0.0000 100.00%
68 ] 0.0000 100.00%
206 0.0000 100.00%
3802 ``` 0.0000 100.00%
255001 0.0000 100.00%
Code snippet

The code snippet that i use can be found below:

from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np

model_id = "CohereForAI/c4ai-command-r-plus-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id, token="HF_TOKEN")
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map="auto",
                                             token="HF_TOKEN")


# Format message with the command-r tool use template
conversation = [
    {"role": "user", "content": "What is the biggest penguin?"}
]
# Define tools available for the model to use:
tools = [
  {
    "name": "internet_search",
    "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
    "parameter_definitions": {
      "query": {
        "description": "Query to search the internet with",
        "type": 'str',
        "required": True
      }
    }
  },
  {
    'name': "directly_answer",
    "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
    'parameter_definitions': {}
  }
]

formatted_input = tokenizer.apply_tool_use_template(conversation, tools=tools, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    formatted_input, 
    max_new_tokens=100, 
    do_sample=True, 
    temperature=0.3,
    return_dict_in_generate=True,
    output_scores=True
    )

transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True)

input_length = formatted_input.shape[1]
generated_tokens = outputs.sequences[:, input_length:]
for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string | logits | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.4f} | {np.exp(score.numpy()):.2%}")

Edit: formatted table

1 Like

@joaogante thanks for this, this looks great. I was looking for your solution for being able to deploy this via an Inference Endpoint, but I don’t see that here.

For example, I want to be able to get the scores of the “unsafe” results from this model: meta-llama/Llama-Guard-3-8B · Hugging Face

I tried creating my own custom handler.py to do it, but wasn’t able to get it to work quite right. Any insight into how we could do this for an Inference Endpoint would be amazing.

1 Like

Hey, I want to understand the how the output is generated by model. so basically the sequence with least cumulative probability is selected. but when I am adding the least scores of the individual tokens. the one with least cumulative sequence is not the model generated response.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Llama-3.2-1B”)
model = AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-3.2-1B”, output_attentions=True)

Input

input_text = “How is the weather in Chennai?”
inputs = tokenizer(input_text, return_tensors=“pt”)

outputs = model.generate(
**inputs,
do_sample=False, # Disable sampling for deterministic behavior
temperature=0.0,
max_new_tokens=3,
num_beams=4,
num_return_sequences=4,
return_dict_in_generate=True,
output_scores=True,
early_stopping=True,
)

Show top-k tokens and their log probs at each step

top_k = 3
for step, scores in enumerate(outputs.scores):
probs = torch.log_softmax(scores, dim=-1)
top_probs, top_tokens = torch.topk(probs, top_k)

print(f"\nStep {step + 1}:")
for i in range(top_k):
  for token, log_prob in zip(top_tokens[i], top_probs[i]):
      decoded_token = tokenizer.decode(token)
      print(f"Token: {decoded_token!r}, Log Probability: {log_prob.item():.4f}")

Show final sequences with log probabilities

transition_scores = model.compute_transition_scores(outputs.sequences, outputs.scores, normalize_logits=True)
for seq, scores in zip(outputs.sequences, transition_scores):
decoded = tokenizer.decode(seq, skip_special_tokens=True)
total_log_prob = scores.sum().item()
print(f"\nGenerated sum: {decoded}\nLog Probability: {total_log_prob}")

response = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(response)

1 Like

@joaogante
how to calculate the sequence with highest cumulative probability.
Discrepancy Between Top-k Token Probabilities and Model-Generated Output in Beam Search

I’ve been analyzing the beam search decoding process and noticed an inconsistency. When I manually construct a sequence using the highest cumulative log probabilities from the top-k tokens at each step, it does not match the model’s final generated output. Additionally, some words in the generated output are not even present in the top-k tokens. Also, the cumulative log probability of the model’s output is lower than the manually computed one. Could other hidden factors be influencing this? Any insights would be appreciated.

1 Like