Hello, apologies for the newbie question.
I am using the following code to obtain outputs from BLOOM. I have set the “return_full_text” option to False, however, I always get the full input text back along with the predicted completions in the generated output string, whether this is set to True or False. The “max_new_tokens” option is working, I can confirm, because the length of the output string varies appropriately when I change this. However the other parameter I have in there, “return_full_text”: False, doesn’t impact the output.
Do I have the syntax wrong? The code was drawn from this post
import requests
API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": "Bearer <TOKEN>"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
while True:
text_input = input("Insert your input: ")
output = query({
"inputs": text_input,
"parameters": {"max_new_tokens": 64,
"return_full_text": False},
"options": {"use_gpu": True, "use_cache": False}
})
print(output)
Also the “use_gpu”: True parameter doesn’t appear to work. I am on the paid $9 a month plan, so it should be available to me. However the results don’t come back any faster with this option. I’ve timed it with some sample queries that take about 14~15 seconds to come back, and I get more or less the same response time whether use_gpu is set to True or False. Not sure what is happening, possibilities include: (a) I have the syntax wrong, (b) it doesn’t make much of a difference anyway (doubt it), (c) system thinks i’m not allowed to use the GPU, and other possibilities.
To troubleshoot i would be nice if I could get back a debug-level response but it appears the response is limited to the generated text. Is there any way to induce the system to return more info such as what it interprets the input data as having been, what all the parameters were set to when it generated the response, and so on?
Again sorry for newbie questions, thanks for any help.