[Announcement] Generation: Get probabilities for generated output

davodavo · July 23, 2024, 8:16am

Hello, I was wondering to which extent the probabilities for tool use can be extracted using the approach described in this thread.

I am using a simple tool call to the Cohere Command R+ model. The two tools that are being passed to the model are: 1) a web search and 2) a call to another LLM. Whenever I prompt it with a knowledge question such as “What is the biggest penguin”, the output is a tool selection and the query parameters for an internet search. In this case, the probabilities for each token are 100% except for two tokens of the input query. My assumption would be that whenever I prompt the LLM with a message that can not be solved with the available tools (and the default “directly answer” which uses the LLM to answer the query) the confidence would be significantly lower.

However, if I add 2 tools for multiplying and summing integers and prompt with the query “what is the temperature for today” I still get high confidence for the “directly answer” action. Is my assumption wrong that the probabilities as defined above cannot be used to estimate the probabilities for a tool call?

Token output with probabilities

token	token string	logits	probability
9814	Action	0.0000	100.00%
33	:	0.0000	100.00%
15080	```	0.0000	100.00%
6329	json	0.0000	100.00%
206		0.0000	100.00%
66	[	0.0000	100.00%
1856		0.0000	100.00%
1936	{	0.0000	100.00%
1890		0.0000	100.00%
1789	"	0.0000	100.00%
22018	tool	0.0000	100.00%
70	_	0.0000	100.00%
2769	name	0.0000	100.00%
2209	":	0.0000	100.00%
1789	"	0.0000	100.00%
6903	web	-0.0003	99.97%
70	_	0.0000	100.00%
9363	search	0.0000	100.00%
2040	",	0.0000	100.00%
1890		0.0000	100.00%
1789	"	0.0000	100.00%
21508	parameters	0.0000	100.00%
2209	":	0.0000	100.00%
1936	{	0.0000	100.00%
2087		0.0000	100.00%
1789	"	0.0000	100.00%
8417	query	0.0000	100.00%
2209	":	0.0000	100.00%
1789	"	0.0000	100.00%
214226	biggest	-0.8203	44.03%
211829	penguin	-0.0036	99.64%
7754	species	0.0000	100.00%
9	"	0.0000	100.00%
1890		0.0000	100.00%
2046	}	0.0000	100.00%
1856		0.0000	100.00%
2046	}	0.0000	100.00%
206		0.0000	100.00%
68	]	0.0000	100.00%
206		0.0000	100.00%
3802	```	0.0000	100.00%
255001		0.0000	100.00%

Code snippet

The code snippet that i use can be found below:

from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np

model_id = "CohereForAI/c4ai-command-r-plus-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id, token="HF_TOKEN")
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map="auto",
                                             token="HF_TOKEN")


# Format message with the command-r tool use template
conversation = [
    {"role": "user", "content": "What is the biggest penguin?"}
]
# Define tools available for the model to use:
tools = [
  {
    "name": "internet_search",
    "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
    "parameter_definitions": {
      "query": {
        "description": "Query to search the internet with",
        "type": 'str',
        "required": True
      }
    }
  },
  {
    'name': "directly_answer",
    "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
    'parameter_definitions': {}
  }
]

formatted_input = tokenizer.apply_tool_use_template(conversation, tools=tools, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(
    formatted_input, 
    max_new_tokens=100, 
    do_sample=True, 
    temperature=0.3,
    return_dict_in_generate=True,
    output_scores=True
    )

transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True)

input_length = formatted_input.shape[1]
generated_tokens = outputs.sequences[:, input_length:]
for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string | logits | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.4f} | {np.exp(score.numpy()):.2%}")

Edit: formatted table

jamie-de · November 7, 2024, 6:58pm

@joaogante thanks for this, this looks great. I was looking for your solution for being able to deploy this via an Inference Endpoint, but I don’t see that here.

For example, I want to be able to get the scores of the “unsafe” results from this model: meta-llama/Llama-Guard-3-8B · Hugging Face

I tried creating my own custom handler.py to do it, but wasn’t able to get it to work quite right. Any insight into how we could do this for an Inference Endpoint would be amazing.

Jyotsna3498 · January 16, 2025, 11:44am

Hey, I want to understand the how the output is generated by model. so basically the sequence with least cumulative probability is selected. but when I am adding the least scores of the individual tokens. the one with least cumulative sequence is not the model generated response.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Llama-3.2-1B”)
model = AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-3.2-1B”, output_attentions=True)

Input

input_text = “How is the weather in Chennai?”
inputs = tokenizer(input_text, return_tensors=“pt”)

outputs = model.generate(
**inputs,
do_sample=False, # Disable sampling for deterministic behavior
temperature=0.0,
max_new_tokens=3,
num_beams=4,
num_return_sequences=4,
return_dict_in_generate=True,
output_scores=True,
early_stopping=True,
)

Show top-k tokens and their log probs at each step

top_k = 3
for step, scores in enumerate(outputs.scores):
probs = torch.log_softmax(scores, dim=-1)
top_probs, top_tokens = torch.topk(probs, top_k)

print(f"\nStep {step + 1}:")
for i in range(top_k):
  for token, log_prob in zip(top_tokens[i], top_probs[i]):
      decoded_token = tokenizer.decode(token)
      print(f"Token: {decoded_token!r}, Log Probability: {log_prob.item():.4f}")

Show final sequences with log probabilities

transition_scores = model.compute_transition_scores(outputs.sequences, outputs.scores, normalize_logits=True)
for seq, scores in zip(outputs.sequences, transition_scores):
decoded = tokenizer.decode(seq, skip_special_tokens=True)
total_log_prob = scores.sum().item()
print(f"\nGenerated sum: {decoded}\nLog Probability: {total_log_prob}")

response = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(response)

Jyotsna3498 · January 20, 2025, 9:22am

@joaogante
how to calculate the sequence with highest cumulative probability.
Discrepancy Between Top-k Token Probabilities and Model-Generated Output in Beam Search

I’ve been analyzing the beam search decoding process and noticed an inconsistency. When I manually construct a sequence using the highest cumulative log probabilities from the top-k tokens at each step, it does not match the model’s final generated output. Additionally, some words in the generated output are not even present in the top-k tokens. Also, the cumulative log probability of the model’s output is lower than the manually computed one. Could other hidden factors be influencing this? Any insights would be appreciated.

Topic		Replies	Views
Generation Probabilities: How to compute probabilities of output scores for GPT2 🤗Transformers	24	28609	April 5, 2023
Get probability of LLM outputting token sequence Beginners	1	3141	November 28, 2023
Generation scores Beginners	0	599	April 24, 2023
Computing Log-Probabilities in Two Different Ways Beginners	0	510	November 29, 2023
How to get probability of the first generated token? Beginners	2	1663	July 18, 2020

[Announcement] Generation: Get probabilities for generated output

Input

Show top-k tokens and their log probs at each step

Show final sequences with log probabilities

Related topics