I assume the answer is no. The inference endpoints seem like short lived connection endpoints only (similar to AWS lambdas).
I understand it’s possible in huggingface spaces, for example via using server side events, but I would prefer to have the scaling capabilities of inference endpoints for my application.
Is it possible?
See for streaming responses:
If you mean other protocols than HTTP then the answer is no. Deploy LLMs with Hugging Face Inference Endpoints
Hi Phil. Thanks for the response!
My use case is using a custom handler endpoint (handler.py), is there a way I can set that up to be able to work? I could setup, for example, a SSE streaming response within the handler.py
Any advice appreciated!
Edit: For full context, my application is using llamaindex with a remote call to openai’s chatgpt-4. Chatgpt4 can give streaming output and I would like to forward the generator output using SSE to the client
custom handler are only supporting traditional HTTP request ↔ response. To add this you would need to create a custom container yourself, that implements the feature.
Is there a tutorial for this? I haven’t been able to get a custom container to work. It always ends up being stuck at initializing with the logs giving now insights. Does the model repository still need to have a handler.py file?
It ended up being an issue on my end with setting the right AWS credentials.
Vielen Dank für deine Hilfe.