Streaming partial results from hosted text-generation APIs?

Is it possible to call the hosted text-generation APIs in such a way as to get low-latency partial streaming results, without having to wait for the full completion to be returned as JSON?

OpenAI has a stream parameter, documented here:

And InferKit has a streamResponse parameter, documented here:

https://inferkit.com/docs/api/generation

But I can’t find anything similar in the Huggingface API docs:

Based on the lack of response, I’m assuming this isn’t currently possible with the hosted Huggingface APIs. Is this the kind of thing that might easily be implemented, if I file a feature-request ticket on the github project?

Hey @benjismith. Sorry for the slow response. The API has support for streaming as documented here: Parallelism and batch jobs — Api inference documentation. It has a small example that we hope helps you.

cc @Narsil

Hi @benjismith,

For long generation, we currently don’t have a chunking option like InferKit seems to propose.
What we do have is a parameter max_time to limit the time of the in flight request (since latency seems to depend on actual usage and user, if you’re doing live suggestions, then time to the first suggestion is really important).

The streaming mentioned by @osanseviero is another option, but more fit for “batch-like” jobs, where you want to feed a GPU and you are not as latency sensitive (so you can send all your work at once, and just wait for the answers).
If I understand correctly, for long generation this would not be the case since every new generation would depend on the previously generated token. Yet definitely a viable option.

Here is an example of how that would work in Python:

import json
import os

import requests

API_TOKEN = os.getenv("HF_API_TOKEN")
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"


def streaming(payload):

    inputs = ""
    for i in range(10):
        data = json.dumps(payload)
        response = requests.request("POST", API_URL, headers=headers, data=data)
        return_data = json.loads(response.content.decode("utf-8"))
        inputs = return_data[0]["generated_text"]
        payload["inputs"] += inputs
        print("---" * 20)
        print(payload["inputs"])
        print("---" * 20)


streaming(
    {
        "inputs": "The answer to the universe is",
        "parameters": {"max_time": 1, "return_full_text": False},
    }
)

Would something like that be viable ?

At any rate, if I wanted to file an enhancement request in a github ticket for something like this, should I create it in the transformers project, or the huggingface_hub project?

@benjismith, Did you ever file the request?

Hey guys! I’ve made a streaming generation service for Hugging Face transformers that is fully compatible with the OpenAI API: https://github.com/hyperonym/basaran

It also comes with a fancy playground. :wink:

playground

1 Like

document