BLOOM parameter '"return_full_text": False' isn't being respected, and the "use_gpu" option doesn't appear to be working

Hello, apologies for the newbie question.

I am using the following code to obtain outputs from BLOOM. I have set the “return_full_text” option to False, however, I always get the full input text back along with the predicted completions in the generated output string, whether this is set to True or False. The “max_new_tokens” option is working, I can confirm, because the length of the output string varies appropriately when I change this. However the other parameter I have in there, “return_full_text”: False, doesn’t impact the output.

Do I have the syntax wrong? The code was drawn from this post

import requests

API_URL = ""
headers = {"Authorization": "Bearer <TOKEN>"}

def query(payload):
        response =, headers=headers, json=payload)
        return response.json()

while True:
    text_input = input("Insert your input: ")
    output = query({
            "inputs": text_input,
            "parameters": {"max_new_tokens": 64,
                           "return_full_text": False},
            "options": {"use_gpu": True, "use_cache": False}


Also the “use_gpu”: True parameter doesn’t appear to work. I am on the paid $9 a month plan, so it should be available to me. However the results don’t come back any faster with this option. I’ve timed it with some sample queries that take about 14~15 seconds to come back, and I get more or less the same response time whether use_gpu is set to True or False. Not sure what is happening, possibilities include: (a) I have the syntax wrong, (b) it doesn’t make much of a difference anyway (doubt it), (c) system thinks i’m not allowed to use the GPU, and other possibilities.

To troubleshoot i would be nice if I could get back a debug-level response but it appears the response is limited to the generated text. Is there any way to induce the system to return more info such as what it interprets the input data as having been, what all the parameters were set to when it generated the response, and so on?

Again sorry for newbie questions, thanks for any help.


Hi @ahomosapiens,

I posted a similar question related to the stopping criteria parameter (Stopping criteria BLOOM), but I haven’t received an answer yet… I would say that it is a problem with BLOOM because other models with the same parameters work correctly… I haven’t tried the use_gpu parameter, but I think that your code is correct.

I created a new discussion in the BLOOM community with regards to the stopping criteria parameter. If I get any response, I’ll let you know.


Great thanks for the note!

The BLOOM model is new of course so a few quirks are to be expected… I’m so happy this has come out! Exciting times and if you have any followup thoughts I’d love to hear from you.


Any update on this? I have the same problem.