Is it possible to have streaming responses from inference endpoints?

I assume the answer is no. The inference endpoints seem like short lived connection endpoints only (similar to AWS lambdas).
I understand it’s possible in huggingface spaces, for example via using server side events, but I would prefer to have the scaling capabilities of inference endpoints for my application.

Is it possible?

See for streaming responses:

If you mean other protocols than HTTP then the answer is no. Deploy LLMs with Hugging Face Inference Endpoints

Hi Phil. Thanks for the response!

My use case is using a custom handler endpoint (handler.py), is there a way I can set that up to be able to work? I could setup, for example, a SSE streaming response within the handler.py

Any advice appreciated!

Edit: For full context, my application is using llamaindex with a remote call to openai’s chatgpt-4. Chatgpt4 can give streaming output and I would like to forward the generator output using SSE to the client

custom handler are only supporting traditional HTTP request ↔ response. To add this you would need to create a custom container yourself, that implements the feature.

Is there a tutorial for this? I haven’t been able to get a custom container to work. It always ends up being stuck at initializing with the logs giving now insights. Does the model repository still need to have a handler.py file?

Hey Phil,

It ended up being an issue on my end with setting the right AWS credentials.
Vielen Dank für deine Hilfe.

Volker