I’ve got a trained/tuned model based on Michau/t5-base-en-generate-headline
. I’m looking into options for deploying this model around a simple inference API (Python/Flask). I’m very new to developing and deploying ML models etc. so bear with me!
Though I have it working, the performance it less than optimal. In a local development environment (VM + Docker) each request takes ~30" (compared to ~10" in Collab - no GPU). In a production environment, this is going to be run many 1,000’s of times, daily…ideally.
So far my trained model (pytorch_model.bin
) is ~900 MB. I do inference pretty simply:
MODEL_PATH = "/src/model_files"
def infer(title: str) -> str:
model_path = pathlib.Path(MODEL_PATH).absolute()
title_model_tokenizer = AutoTokenizer.from_pretrained(model_path)
title_model = AutoModelWithLMHead.from_pretrained(model_path)
tokenized_text = title_model_tokenizer.encode(title, return_tensors="pt")
title_ids = title_model.generate(
tokenized_text,
num_beams=1,
repetition_penalty=1.0,
length_penalty=1.0,
early_stopping=False,
no_repeat_ngram_size=1
)
return title_model_tokenizer.decode(title_ids[0], skip_special_tokens=True)
I can see from some profiling that the majority of the time is spent on:
title_model = AutoModelWithLMHead.from_pretrained(model_path)
So change #1 is to send in batches rather than one at a time, where possible. So as to not have to reload the model constantly.
Is there anything obvious I’m missing or yet to discover for this type of thing? I’m hoping to not need a GPU, so any ideas or improvement you can throw at me would be appreciated, thanks.
A secondary question is where would be suitable to deploy this kind of thing? Is it something that would be better outsourced to Sagemaker or similar? Or is it reasonable to host it on our own servers (specs notwithstanding)?