what’s a common reference architecture for companies that use sentence transformers via huggingface in production?
i was thinking:
api gateway → queue → serverless (sentence transformer module)
is it best to co-locate the model file in my lambda VPN? looking for any and all best practices.
Hi there! I think the things that it depends on most are:
- Your company’s existing stack
- Your use-case (expected load, real-time vs. batched, etc.)
Can you share a bit more about what those things look like in your situation? Without that (IMO) it’s a bit difficult to give any recommendations. I could point you towards our Inference API service, for example (Inference API - Hugging Face), which lets you offload that to our infrastructure. Or you could take an approach like the one outlined here: How to Deploy NLP Models in Production - neptune.ai. Some companies might set up entire CI/CD situations if they need to constantly monitor, retrain, and redeploy their models (Continuous Delivery for Machine Learning).
If you have more details about your use-case I can definitely try to provide more details!
1 Like