Is it possible to call the hosted text-generation APIs in such a way as to get low-latency partial streaming results, without having to wait for the full completion to be returned as JSON?
OpenAI has a stream parameter, documented here:
And InferKit has a streamResponse parameter, documented here:
Based on the lack of response, I’m assuming this isn’t currently possible with the hosted Huggingface APIs. Is this the kind of thing that might easily be implemented, if I file a feature-request ticket on the github project?
For long generation, we currently don’t have a chunking option like InferKit seems to propose.
What we do have is a parameter max_time to limit the time of the in flight request (since latency seems to depend on actual usage and user, if you’re doing live suggestions, then time to the first suggestion is really important).
The streaming mentioned by @osanseviero is another option, but more fit for “batch-like” jobs, where you want to feed a GPU and you are not as latency sensitive (so you can send all your work at once, and just wait for the answers).
If I understand correctly, for long generation this would not be the case since every new generation would depend on the previously generated token. Yet definitely a viable option.
Here is an example of how that would work in Python:
import json
import os
import requests
API_TOKEN = os.getenv("HF_API_TOKEN")
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"
def streaming(payload):
inputs = ""
for i in range(10):
data = json.dumps(payload)
response = requests.request("POST", API_URL, headers=headers, data=data)
return_data = json.loads(response.content.decode("utf-8"))
inputs = return_data[0]["generated_text"]
payload["inputs"] += inputs
print("---" * 20)
print(payload["inputs"])
print("---" * 20)
streaming(
{
"inputs": "The answer to the universe is",
"parameters": {"max_time": 1, "return_full_text": False},
}
)
At any rate, if I wanted to file an enhancement request in a github ticket for something like this, should I create it in the transformers project, or the huggingface_hub project?
Hey guys! I’ve made a streaming generation service for Hugging Face transformers that is fully compatible with the OpenAI API: https://github.com/hyperonym/basaran