Who’s running open-source LLMs in enterprise production, and how?

Hi everyone,

I wasn’t sure which category this best fits in, so I’m posting here since it’s about production deployment.

I’m trying to understand how enterprise teams are deploying open-source LLMs in real production environments.

If you’re running models internally or on your own infrastructure, I’d love to hear about your setup:

  • How you’re serving the model

  • The hardware or cloud configuration you’ve found viable

  • Key challenges you’ve hit (throughput, latency, cost, monitoring, compliance)

  • And what finally made your setup feel production-ready

I’m especially interested in enterprise use cases that have actually gone live versus those that stayed in prototype.

Feel free to share deep technical details or architecture notes if you can.

Thanks for taking the time to share your experience.

1 Like

Hi @iknowjerome :waving_hand:

That’s a super interesting conversation opener and something I’ve worked with a lot both at Hugging Face and before. So although I’m an HF employee I think I can give some nice insights on the matter, I’ll try to be balanced :grinning_face_with_smiling_eyes:

From a technical & model perspective the options to do production deployments of LLMs has improved a ton in the last 2 years (at least). I’d say that using vLLM is the best option where possible. You’ll get continuous batching, chunked prefill, sota kernels and a lot more out of the box. That being said: if you’re model isn’t supported by vLLM, you’re facing a bigger challenge to implement the server engine yourself in most cases.

On the hardware side GPU-utilization is still super hard. Provisioning instances quickly to have fast auto-scaling isn’t really there yet and just running a model (efficiently) on several nodes is tricky. Even the big labs have outages constantly due to this. My recommended reading on this topic is from the vLLM Paris meetup a few months ago, it has some good technical explanations.

And lastly as a small pitch, we of course offer Inference Endpoints as a managed service to help companies get AI models to product with as little hassle as possible. There are two case study blogs on it that you can read as well:

Hopefully this opens up the current state a little bit :grinning_face_with_smiling_eyes: :raising_hands:

2 Likes

@erikkaum This is incredibly helpful. Thank you.

2 Likes