Accelerated Inference API not taking parameters?

Hello! I’m trying to generate text using a fine-tuned T5, and I’m running into some truncation issues.

From the docs, I can see that if I send the parameter max_new_tokens I can potentially get longer answers, but the API is responding always in the same length no matter what I do. Here is the payload I’m sending. I also tried wiggling the parameter names and so, the API is validating unknown parameters responding 400, but it keeps truncating the response when I send what it looks like a correct request.

If I use the model in transformers, I get longer responses.

Here is what I’m sending.

const inference_endpoint = ""

    headers: {
      "Authorization": "Bearer " + process.env.HF_TOKEN,
      "Content-Type": "application/json"
    url: inference_endpoint,
    method: "post",
    data: {
      inputs: query,
      parameters: {
        max_new_tokens: 196,
      options: {
        wait_for_model: await_for_model

Am I doing anything wrong? Thanks in advance for your help!!

1 Like

Same problem
How can one define max lenght output size in an inference API ?

Hi @juancavallotti,
Did you try ‘max_length’ instead of ‘max_new_tokens’?

Yup the parameter didn’t get rejected but didn’t work either

any update/solution on this issue ? i’m encountering the same problem with the bloom model, none of the advanced parameters of the inference API seems to work.

Same problem for me and I didn’t receive any reply in my thread. It seems like the Inference API for BLOOM is basically broken and only allows basic generation.

Personally, I’ve found the following parameters being ignored:

  • max_new_tokens
  • temperature
  • do_sample
  • use_gpu (but this is to be expected, afaik HF handles GPU on Inference API with a separate pricing)

do_sample is particularly frustrating as BLOOM generates the same output over and over for a given input. I managed to force a bit of variation by playing with top_k, but this is not very rigorous.