Pretrained Models to Heroku Production Environment

vihardesu · July 9, 2020, 6:20pm

Hi! I’m new to hugging face, so excuse this if it’s a dumb question!

I’m trying to expose the T5 small pre_trained model as a REST endpoint. I have both the model code and the python Flask API working in a virtual environment in python on my computer. When I try to host the API on heroku, the code size exceeds the RAM limit. Are there any best practices for storing these models as static assets that can be called by an API? Is there a better way to do this to avoid hitting the RAM limit?

FYI: I used this tutorial for the model code - https://towardsdatascience.com/simple-abstractive-text-summarization-with-pretrained-t5-text-to-text-transfer-transformer-10f6d602c426

clem · July 9, 2020, 8:57pm

Not sure if it can help for your usecase but we are providing an API inference endpoint for T5: https://huggingface.co/t5-small?text=My+name+is+Sarah+and+I+live+in+London. Will soon be usable for other tasks than translation.

vihardesu · July 9, 2020, 9:58pm

Thanks for the quick response clem! This would be perfect. Do you have abstractive summarization API endpoints available at the moment?

Tried this one - it’s returning an error:
https://huggingface.co/remi/bertabs-finetuned-cnndm-extractive-abstractive-summarization?text=Paris+is+the+[MASK]+of+France.

yjernite · July 9, 2020, 10:24pm

You can try the bart-large-cnn checkpoint, let us know how that goes!
https://huggingface.co/facebook/bart-large-cnn

sshleifer · July 10, 2020, 1:19am

or bart-large-xsum if you want slightly shorter summaries!
By default, bart-large-cnn summaries will be 56-142 tokens and bart-large-xsum summaries will be 10-62 tokens.

valhalla · July 10, 2020, 5:18am

hi @vihardesu you can get T5 working on inference API for tasks other than translation using a simple hack. If you want to do summerization then replace the all task_specific_params in config.json with that of summerization tasks (don’t change task names), so it should look something like

"task_specific_params": {
    "summarization": {
       "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    }
  },

This should work for now

Topic		Replies	Views
AWS Lambda + Transformers + Docker = use High RAM for summarization model 🤗Transformers	1	595	June 26, 2023
Inference API for Tokenizers Beginners	0	239	November 17, 2022
What token settings are used on the Hosted inference API 🤗Hub	1	708	November 4, 2022
Inference service for large models, such as Vicuna 13b Beginners	0	1427	May 5, 2023
How to deploy a T5 model to AWS SageMaker for fast inference? Amazon SageMaker	13	5787	February 28, 2022

Pretrained Models to Heroku Production Environment

Related topics