How do I get a model to return tensors for cloze completion?

Hey!

Ultimately I want a reasonably modern model to do cloze completion tasks. So, give it a sentence like:
“Alice and Bob went to the market to bu_” and ask it what goes in the blank (here, the highest-probability answer is “Y”).

I’d like to know the probability (or logit) of all the tokens it’s considering. I was doing this with OpenAI, but they made an API change that makes it very expensive.

I’m using a 72B model ( liberated-qwen1-5-72b-kbd) on a Huggingface hosted endpoint. Before I was using system messages explaining what I wanted (with OpenAI’s chat interface), but I suppose that’s not required.

I’m doing this:

import requests

API_URL = “https://somestuff.us-east-1.aws.endpoints.huggingface.cloud
headers = {
“Accept” : “application/json”,
“Authorization”: “Bearer hf_something”,
“Content-Type”: “application/json”
}

def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()

output = query({
“inputs”: “Can you please let us know more details about yo”,
“parameters”: {
“temperature”: 1,
“max_new_tokens”: 10,
“do_sample”: False,
“return_tensors”: True,
# “prefix”: “somestuff”
}
})

and I get back this: [{‘generated_text’: ‘Can you please let us know more details about yoour issue? What is the issue you are facing’}]

It’s just a list with one dict inside. I don’t seem to be getting tensors back.

What am I doing wrong?