What is best way to serve huggingface model with API?

Use TF or PyTorch?

For PyTorch TorchServe or Pipelines with something like flask?

1 Like

You have a few different options, here are some in increasing level of difficulty

  1. You can use the Hugging Face Inference API via Model Hub if you are just looking for a demo.
  2. You can use a hosted model deployment platform: GCP AI predictions, SageMaker, https://modelzoo.dev/. Full disclaimer, I am the developer behind Model Zoo, happy to give you some credits for experimentation.
  3. You can roll your own model server with something like https://fastapi.tiangolo.com/ and deploy it on a generic serving platform like AWS Elastic Beanstalk or Heroku. This is the most flexible option.

@yoavz Hey I am also looking for an answer regarding this, can you give more reference or tutorial regarding this? Thank you

1 Like

Sure – here is are more links for each path:

  1. Hugging Face Model Hub: https://huggingface.co/transformers/model_sharing.html
  2. Model Zoo: https://docs.modelzoo.dev/quickstart/transformers.html
  1. Roll your own deployment stack: https://github.com/curiousily/Deploy-BERT-for-Sentiment-Analysis-with-FastAPI

Interested in model serving too. I don’t think that FastAPI stack is what we want - good for quickstart, but it’s preferable to have the FastAPI serve your web API and a job queue (eg RabbitMQ) for submitting expensive GPU jobs. I currently have an EC2 instance I spin-up on demand from FastAPI server, submit job, receive results, send to client. Alternatively you can use AWS Batch with a Dockerfile for your transformers models. But in the spin-up case, scaling is a real pain; and in the Batch case, huge overhead in the Batch job coming online just for inference.

What we really want is proper cloud hosting of models, eg via GCP AI Platform. I’m not sure if Model Zoo serves this purpose, I’ll check it out ASAP. I do see https://github.com/maxzzze/transformers-ai-platform/tree/master/models/classification, but its last commit is Feb 10 & my quick-scan of the repo makes me think it might be a bit rigid and will take a fair bit of tinkering for flexible use-cases.

What would really be handy is a tutorial on deploying transformers models to GCP AI. How to prepare & upload; how to separate surrounding code (model prep, tokenization prep, etc); how to deal with their 500mb model quota; all that stuff. Ideally there’d be some fairly 1st-class huggingface exporter, or on-site tutorial.

Actually, this could be a business prop for Hugginface: host your models, and charge for API calls! We’d dev locally to get things sorted, but then switch to API so we don’t have to worry about instance scaling & the like. Anyway, I’ll check out Model Zoo in case that’s what it does.

1 Like

I have an shared an example using Torchserve (for the NER use-case) but it can be extended to other types by using different pipelines.
blogpost and repo
Includes a demo UI too!
(can’t include more links because I’m a new user on this forum…just refer to the blogpost)
Hope it helps~


Is there a way to serve the hugging face bert model with TF serving such that the TF serving handles the tokenization along with inference? Any related documentation or blog post?

@jplu might help with this

1 Like

Hi @anubhavmaity !

Thanks for your question, unfortunately it is currently not possible to integrate the tokenization process along with inference directly inside a saved model. Nevertheless, it is part of our plans to make this available and we are currently rethinking the way the saved models are handled in transformers :slight_smile:

1 Like

I know this is old, but have you seen this ? https://huggingface.co/pricing.

Basically exactly what you’re asking for. We’re hosting your models and running them at scale !