Productionizing HuggingFace Transformers?

what’s a common reference architecture for companies that use sentence transformers via huggingface in production?

i was thinking:

api gateway → queue → serverless (sentence transformer module)

is it best to co-locate the model file in my lambda VPN? looking for any and all best practices.

Hi there! I think the things that it depends on most are:

  1. Your company’s existing stack
  2. Your use-case (expected load, real-time vs. batched, etc.)

Can you share a bit more about what those things look like in your situation? Without that (IMO) it’s a bit difficult to give any recommendations. I could point you towards our Inference API service, for example (Inference API - Hugging Face), which lets you offload that to our infrastructure. Or you could take an approach like the one outlined here: How to Deploy NLP Models in Production - Some companies might set up entire CI/CD situations if they need to constantly monitor, retrain, and redeploy their models (Continuous Delivery for Machine Learning).

If you have more details about your use-case I can definitely try to provide more details!