Streaming partial results from hosted text-generation APIs?

Is it possible to call the hosted text-generation APIs in such a way as to get low-latency partial streaming results, without having to wait for the full completion to be returned as JSON?

OpenAI has a stream parameter, documented here:

And InferKit has a streamResponse parameter, documented here:

https://inferkit.com/docs/api/generation

But I can’t find anything similar in the Huggingface API docs:

Based on the lack of response, I’m assuming this isn’t currently possible with the hosted Huggingface APIs. Is this the kind of thing that might easily be implemented, if I file a feature-request ticket on the github project?

Hey @benjismith. Sorry for the slow response. The API has support for streaming as documented here: Parallelism and batch jobs — Api inference documentation. It has a small example that we hope helps you.

cc @Narsil

Hi @benjismith,

For long generation, we currently don’t have a chunking option like InferKit seems to propose.
What we do have is a parameter max_time to limit the time of the in flight request (since latency seems to depend on actual usage and user, if you’re doing live suggestions, then time to the first suggestion is really important).

The streaming mentioned by @osanseviero is another option, but more fit for “batch-like” jobs, where you want to feed a GPU and you are not as latency sensitive (so you can send all your work at once, and just wait for the answers).
If I understand correctly, for long generation this would not be the case since every new generation would depend on the previously generated token. Yet definitely a viable option.

Here is an example of how that would work in Python:

import json
import os

import requests

API_TOKEN = os.getenv("HF_API_TOKEN")
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"


def streaming(payload):

    inputs = ""
    for i in range(10):
        data = json.dumps(payload)
        response = requests.request("POST", API_URL, headers=headers, data=data)
        return_data = json.loads(response.content.decode("utf-8"))
        inputs = return_data[0]["generated_text"]
        payload["inputs"] += inputs
        print("---" * 20)
        print(payload["inputs"])
        print("---" * 20)


streaming(
    {
        "inputs": "The answer to the universe is",
        "parameters": {"max_time": 1, "return_full_text": False},
    }
)

Would something like that be viable ?

Thanks for the responses, @Narsil and @osanseviero!

I’m integrating the HF text-generation API into a word-processor, and it would be a nice user experience if the tokens could be streamed directly into the client.

Right now, if I’m generating 100 tokens, it might take 10 seconds to complete the task. And the user has to sit and wait the full duration before they can see any results at all. But it probably takes less than 1 second to generate the first token, and each subsequent token probably takes only a few milliseconds to generate… There’s no need to wait for the last token to arrive before showing the first one, so it would be great if the tokens could stream.

Here’s how OpenAI does it…

streaming-tokens

I understand the underlying model class has a generate method, that returns one or more output sequences. But how difficult would it be to have a generateAndStream method, that takes an output stream as an argument, and writes tokens into that stream?

At any rate, if I wanted to file an enhancement request in a github ticket for something like this, should I create it in the transformers project, or the huggingface_hub project?