What is best way to serve huggingface model with API?

dickdanieljr · July 10, 2020, 6:47pm

Use TF or PyTorch?

For PyTorch TorchServe or Pipelines with something like flask?

yoavz · July 25, 2020, 2:11pm

You have a few different options, here are some in increasing level of difficulty

You can use the Hugging Face Inference API via Model Hub if you are just looking for a demo.
You can use a hosted model deployment platform: GCP AI predictions, SageMaker, https://modelzoo.dev/. Full disclaimer, I am the developer behind Model Zoo, happy to give you some credits for experimentation.
You can roll your own model server with something like https://fastapi.tiangolo.com/ and deploy it on a generic serving platform like AWS Elastic Beanstalk or Heroku. This is the most flexible option.

kevinyauris · August 4, 2020, 12:05pm

@yoavz Hey I am also looking for an answer regarding this, can you give more reference or tutorial regarding this? Thank you

yoavz · August 4, 2020, 3:29pm

Sure – here is are more links for each path:

Hugging Face Model Hub: https://huggingface.co/transformers/model_sharing.html
Model Zoo: https://docs.modelzoo.dev/quickstart/transformers.html

yoavz · August 4, 2020, 3:29pm

Roll your own deployment stack: https://github.com/curiousily/Deploy-BERT-for-Sentiment-Analysis-with-FastAPI

lefnire · September 3, 2020, 5:49pm

Interested in model serving too. I don’t think that FastAPI stack is what we want - good for quickstart, but it’s preferable to have the FastAPI serve your web API and a job queue (eg RabbitMQ) for submitting expensive GPU jobs. I currently have an EC2 instance I spin-up on demand from FastAPI server, submit job, receive results, send to client. Alternatively you can use AWS Batch with a Dockerfile for your transformers models. But in the spin-up case, scaling is a real pain; and in the Batch case, huge overhead in the Batch job coming online just for inference.

What we really want is proper cloud hosting of models, eg via GCP AI Platform. I’m not sure if Model Zoo serves this purpose, I’ll check it out ASAP. I do see https://github.com/maxzzze/transformers-ai-platform/tree/master/models/classification, but its last commit is Feb 10 & my quick-scan of the repo makes me think it might be a bit rigid and will take a fair bit of tinkering for flexible use-cases.

What would really be handy is a tutorial on deploying transformers models to GCP AI. How to prepare & upload; how to separate surrounding code (model prep, tokenization prep, etc); how to deal with their 500mb model quota; all that stuff. Ideally there’d be some fairly 1st-class huggingface exporter, or on-site tutorial.

Actually, this could be a business prop for Hugginface: host your models, and charge for API calls! We’d dev locally to get things sorted, but then switch to API so we don’t have to worry about instance scaling & the like. Anyway, I’ll check out Model Zoo in case that’s what it does.

ceyda · October 14, 2020, 4:44am

I have an shared an example using Torchserve (for the NER use-case) but it can be extended to other types by using different pipelines.
blogpost and repo
Includes a demo UI too!
(can’t include more links because I’m a new user on this forum…just refer to the blogpost)
Hope it helps~

anubhavmaity · December 16, 2020, 11:24pm

Is there a way to serve the hugging face bert model with TF serving such that the TF serving handles the tokenization along with inference? Any related documentation or blog post?

clem · December 17, 2020, 3:25am

@jplu might help with this

jplu · December 17, 2020, 9:31am

Hi @anubhavmaity !

Thanks for your question, unfortunately it is currently not possible to integrate the tokenization process along with inference directly inside a saved model. Nevertheless, it is part of our plans to make this available and we are currently rethinking the way the saved models are handled in transformers

Narsil · December 21, 2020, 2:59pm

I know this is old, but have you seen this ? https://huggingface.co/pricing.

Basically exactly what you’re asking for. We’re hosting your models and running them at scale !

AtulMehra · August 29, 2023, 2:21pm

We are considering - Deploying the models in Sagemaker vs Deploying in EC2.
What is others’ opinion about this.

Sagemaker -

We found that there are limited models available in Sagemaker and have dependencies such as some models not available in certain regions
Having a model in S3 bucket may not go well with some regulations which need data to be present locally
We found it expensive. Currently, we want to run it for a while for testing and when not in use, wanted to shut down the instance. But leaving the dev env intact. Sagemaker posed some limitations. Though doable but more work.

Serving model through Django/REST API server:
Currently exploring, downloading a model on EC2 and then running infrence client in an async loop. Thus client->Rest API->Routed to Hugging face infrence objects like Pipeline…

AWS Infrentia servers
Still checking with AWS if that’s a better possibility. The end goal would be to have better latencies and cost optimizations vs EC2. However, it’s not a trouble for us for now as in development/testing - we will have minimal flow.

Would be good to hear others’ thoughts and experience.

Topic		Replies	Views
How can I adapt this code to deploy it in HuggingFace? Beginners	0	241	September 10, 2023
Using huggingface as a hosting / CDN for a pretrained model 🤗Transformers	0	130	November 29, 2024
Is that possible to embed the tokenizer into the model to have it running on GCP using TensorFlow Serving? 🤗Tokenizers	4	3237	January 12, 2023
Help for inference.py code Amazon SageMaker	10	3993	March 8, 2022
Productionizing HuggingFace Transformers? Beginners	1	3142	September 12, 2022

What is best way to serve huggingface model with API?

Related topics