Deploying inference model size and performance

I’ve got a trained/tuned model based on Michau/t5-base-en-generate-headline. I’m looking into options for deploying this model around a simple inference API (Python/Flask). I’m very new to developing and deploying ML models etc. so bear with me!

Though I have it working, the performance it less than optimal. In a local development environment (VM + Docker) each request takes ~30" (compared to ~10" in Collab - no GPU). In a production environment, this is going to be run many 1,000’s of times, daily…ideally.

So far my trained model (pytorch_model.bin) is ~900 MB. I do inference pretty simply:

MODEL_PATH = "/src/model_files"

def infer(title: str) -> str:
    model_path = pathlib.Path(MODEL_PATH).absolute()
    title_model_tokenizer = AutoTokenizer.from_pretrained(model_path)
    title_model = AutoModelWithLMHead.from_pretrained(model_path)

    tokenized_text = title_model_tokenizer.encode(title, return_tensors="pt")
    title_ids = title_model.generate(
        tokenized_text,
        num_beams=1,
        repetition_penalty=1.0,
        length_penalty=1.0,
        early_stopping=False,
        no_repeat_ngram_size=1
    )

    return title_model_tokenizer.decode(title_ids[0], skip_special_tokens=True)

I can see from some profiling that the majority of the time is spent on:

title_model = AutoModelWithLMHead.from_pretrained(model_path)

So change #1 is to send in batches rather than one at a time, where possible. So as to not have to reload the model constantly.

Is there anything obvious I’m missing or yet to discover for this type of thing? I’m hoping to not need a GPU, so any ideas or improvement you can throw at me would be appreciated, thanks.

A secondary question is where would be suitable to deploy this kind of thing? Is it something that would be better outsourced to Sagemaker or similar? Or is it reasonable to host it on our own servers (specs notwithstanding)?

1 Like

Hi @SMB, I think a quick win here would be to load the tokenizer and model just once when you spin up the Flask app and then call them in your infer function. For example something like this:

from flask import Flask
from transformers import AutoTokenizer, AutoModelWithLMHead


app = Flask(__name__)

MODEL_PATH = "/src/model_files"
model_path = pathlib.Path(MODEL_PATH).absolute()
title_model_tokenizer = AutoTokenizer.from_pretrained(model_path)
title_model = AutoModelWithLMHead.from_pretrained(model_path).to("cpu")

def infer(title: str) -> str:
    tokenized_text = title_model_tokenizer.encode(title, return_tensors="pt")
    title_ids = title_model.generate(
        tokenized_text,
        num_beams=1,
        repetition_penalty=1.0,
        length_penalty=1.0,
        early_stopping=False,
        no_repeat_ngram_size=1
    )

    return title_model_tokenizer.decode(title_ids[0], skip_special_tokens=True)

As for deployment, the answer probably depends on your production environment but what I’ve usually done is to Dockerize the Flask app and then deploy the Docker container on Kubernetes or whatever platform is available (e.g. Heroku). There’s quite a few tutorials online on how to do this and one of my favourite resources for this kind of stuff is the Full Stack DL course: https://fullstackdeeplearning.com/

Note that Kubernetes is a complex beast and not recommended for a single-purpose app :slight_smile:

PS. If you’re not bound to using Flask, I suggest having a look at FastAPI. It makes web app development much simpler, can handle concurrency, and comes with neat built in features like data validation which is really useful for ML!

1 Like

Thanks @lewtun that’s an obvious change I can make that will be a quick-win as you say. Thanks also for the deployment suggestions, I’m familiar with Heroku, but will look at the other options also, and FastAPI, cheers.

1 Like

In case this still doesn’t meet your latency requirements, my next suggestion would be to quantize the model’s weights: Quantization — PyTorch 1.7.1 documentation

It basically amounts to running this line of code

model_int8 = torch.quantization.quantize_dynamic(
    model,              # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

but you should do some sanity checks to make sure the resulting accuracy hasn’t degraded too much (in which case you can try 16-bit precision).

If you want to take it further, you can try serialising the model with ONNX (although this can be a bit finnicky): Exporting transformers models — transformers 4.3.0 documentation

Revisiting this @lewtun You’re comments proved very helpful in putting me on the right path. I’ve moved to FastAPI and have some questions about concurrency.

I’m testing locally and running within a docker container using the command:

gunicorn app:app --worker-class uvicorn.workers.UvicornWorker --workers 2 --bind 0.0.0.0:8080

When I run 10 requests (each containing 50 pieces of text to run inference on) from two separate PHP workers (run by supervisor) that call http//0.0.0.0:8080/predict/batch response times are:

app_1  | 0:00:37.766941
app_1  | 0:00:38.065883
app_1  | 0:00:35.876927
app_1  | 0:00:38.612610
app_1  | 0:00:38.268158
app_1  | 0:00:37.676483
app_1  | 0:00:36.451956
app_1  | 0:00:39.346421
app_1  | 0:00:32.831906
app_1  | 0:00:33.972420

Total time for queue of 10 requests to clear ~3’

When I run 10 requests from 10 PHP workers I get response times of:

app_1  | 0:02:57.640630
app_1  | 0:03:09.019433
app_1  | 0:03:13.338791
app_1  | 0:03:16.411248
app_1  | 0:03:16.476156
app_1  | 0:03:20.084982
app_1  | 0:03:21.052187
app_1  | 0:03:23.064294
app_1  | 0:03:23.681471
app_1  | 0:03:23.975776

So you can see first request returned at ~2’57", whilst #10 returned after ~3’23"

So, really no different. Or, rather, when using 10 workers it appears they are not really running concurrently?

My code is essentially now:

import pathlib
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelWithLMHead
from typing import List
from datetime import datetime

title_model = None
title_model_tokenizer = None

app = FastAPI()

class TitleIn(BaseModel):
    id: int
    text: str

class TitleOut(BaseModel):
    id: int
    text: str

def infer(title: str) -> str:
    tokenized_text = title_model_tokenizer.encode(title, return_tensors="pt")
    title_ids = title_model.generate(
        tokenized_text,
        num_beams=1,
        repetition_penalty=1.0,
        length_penalty=1.0,
        early_stopping=False,
        no_repeat_ngram_size=1
    )

    return title_model_tokenizer.decode(title_ids[0], skip_special_tokens=True)

@app.on_event("startup")
async def startup_event():
    global title_model, title_model_tokenizer

    model_path = "/src/model_files"
    title_model = AutoModelWithLMHead.from_pretrained(pathlib.Path(model_path).absolute()).to("cpu")
    title_model_tokenizer = AutoTokenizer.from_pretrained(pathlib.Path(model_path).absolute(), use_fast=False)

@app.post('/predict/batch', response_model=List[TitleOut])
def predict_batch(title_list: List[TitleIn]):
    batch_predictions = []

    startTime = datetime.now()

    for title in title_list:
        batch_predictions.append(TitleOut(id=title.id, text=infer(title.text)))

    print(datetime.now() - startTime)
    return batch_predictions

I feel like I’m missing something? Any further tips? Thank you.

Hi @SMB, one thing that might be causing trouble is that PyTorch will attempt to use all cores by default to process each request. You could try setting the number of threads to 1 in your startup_event function with

torch.set_num_threads(1)

See here for more details. I’m not sure this will solve your concurrency problem, but it’s a good place to start :slight_smile: