Inference Client chat completion parameter logit_bias not working

According to the docs, the logit_bias parameter for the chat_completion function expects a “JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100”. The type annotation, however says that it should be an Optional[List[float]].

Indeed, if I try to pass in a dictionary, e.g.

completion = client.chat_completion( model="meta-llama/Llama-3.3-70B-Instruct", messages=messages, max_tokens=100, logit_bias={100: 4} )

I get an HTTPError: 422 Client Error: Unprocessable Entity for url error. I can pass in a list of floats, but have no idea how this is supposed to encode logit biases without a mapping.

Reproduction

from huggingface_hub import InferenceClient

client = InferenceClient(api_key="hf_xxx")

messages = [
	{
		"role": "user",
		"content": "The capital of France is"
	}
]

completion = client.chat_completion(
    model="meta-llama/Llama-3.3-70B-Instruct", 
	  messages=messages,
	  max_tokens=20,
    logit_bias={100: 4}
)

print(completion.choices[0].message)
2 Likes

hi @taylorj94
There is a clearer error when you run the code:

Failed to deserialize the JSON body into the target type: logit_bias: invalid type: map, expected a sequence at line 1 column 137

But I have no idea how to deal without a mapping. :frowning:

2 Likes

I’m not sure what these tests indicate, but I believe you could provide a list that is as long as the size of vocabulary.

from huggingface_hub import InferenceClient
client = InferenceClient(api_key="hf_xxx")
messages = [
	{
		"role": "user",
		"content": "The capital of France is"
	}
]
completion = client.chat_completion(
    model="microsoft/Phi-3-mini-4k-instruct", 
	  messages=messages,
	  max_tokens=20,
	  logit_bias=30000*[-100]
)
completion

generated
ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content='Paris', tool_calls=None), logprobs=None)], created=1735226825, id='', model='microsoft/Phi-3-mini-4k-instruct', usage=ChatCompletionOutputUsage(completion_tokens=2, prompt_tokens=8, total_tokens=10))


from huggingface_hub import InferenceClient
client = InferenceClient(api_key="hf_xxx")
messages = [
	{
		"role": "user",
		"content": "The capital of France is"
	}
]
completion = client.chat_completion(
    model="microsoft/Phi-3-mini-4k-instruct", 
	  messages=messages,
	  max_tokens=20,
	  logit_bias=30000*[100]
)
completion

generated

ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content='The capital of France is Paris.', tool_calls=None), logprobs=None)], created=1735227061, id='', model='microsoft/Phi-3-mini-4k-instruct', usage=ChatCompletionOutputUsage(completion_tokens=8, prompt_tokens=8, total_tokens=16))

content=‘Paris’ vs content=‘The capital of France is Paris.’

I guess you can create a list with len(vocab) * [100] and assign a value of -100 to some of the tokens.

2 Likes