Streaming partial results from hosted text-generation APIs?

benjismith · December 3, 2021, 5:21pm

Is it possible to call the hosted text-generation APIs in such a way as to get low-latency partial streaming results, without having to wait for the full completion to be returned as JSON?

OpenAI has a stream parameter, documented here:

And InferKit has a streamResponse parameter, documented here:

https://inferkit.com/docs/api/generation

But I can’t find anything similar in the Huggingface API docs:

benjismith · December 9, 2021, 9:10pm

Based on the lack of response, I’m assuming this isn’t currently possible with the hosted Huggingface APIs. Is this the kind of thing that might easily be implemented, if I file a feature-request ticket on the github project?

osanseviero · December 10, 2021, 9:11am

Hey @benjismith. Sorry for the slow response. The API has support for streaming as documented here: Parallelism and batch jobs — Api inference documentation. It has a small example that we hope helps you.

cc @Narsil

Narsil · December 10, 2021, 10:47am

Hi @benjismith,

For long generation, we currently don’t have a chunking option like InferKit seems to propose.
What we do have is a parameter max_time to limit the time of the in flight request (since latency seems to depend on actual usage and user, if you’re doing live suggestions, then time to the first suggestion is really important).

The streaming mentioned by @osanseviero is another option, but more fit for “batch-like” jobs, where you want to feed a GPU and you are not as latency sensitive (so you can send all your work at once, and just wait for the answers).
If I understand correctly, for long generation this would not be the case since every new generation would depend on the previously generated token. Yet definitely a viable option.

Here is an example of how that would work in Python:

import json
import os

import requests

API_TOKEN = os.getenv("HF_API_TOKEN")
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"


def streaming(payload):

    inputs = ""
    for i in range(10):
        data = json.dumps(payload)
        response = requests.request("POST", API_URL, headers=headers, data=data)
        return_data = json.loads(response.content.decode("utf-8"))
        inputs = return_data[0]["generated_text"]
        payload["inputs"] += inputs
        print("---" * 20)
        print(payload["inputs"])
        print("---" * 20)


streaming(
    {
        "inputs": "The answer to the universe is",
        "parameters": {"max_time": 1, "return_full_text": False},
    }
)

Would something like that be viable ?

benjismith · December 11, 2021, 12:59am

At any rate, if I wanted to file an enhancement request in a github ticket for something like this, should I create it in the transformers project, or the huggingface_hub project?

chidi · January 22, 2023, 8:02pm

@benjismith, Did you ever file the request?

peakji · March 7, 2023, 2:43pm

Hey guys! I’ve made a streaming generation service for Hugging Face transformers that is fully compatible with the OpenAI API: https://github.com/hyperonym/basaran

It also comes with a fancy playground.

playground

jiangjiang123123 · August 18, 2023, 4:14am

document

Topic		Replies	Views
Streaming token output from models like T5 🤗Transformers	7	12259	June 7, 2023
Text generation. Stream output 🤗Transformers	2	5677	April 4, 2023
Inference API detailed request Beginners	5	2326	September 11, 2020
EleutherAI/gpt-neo-2.7B Models	1	708	June 28, 2021
Custom model.generate() parameters for hosted models Beginners	0	458	October 7, 2021

Streaming partial results from hosted text-generation APIs?

Related topics