Memory issues with model deployment

trackmania · August 18, 2020, 2:57pm

With the help of awesome transformers library I have trained a multilabel classificator, which predicts topics for comments. I’m using TFDistilBertModel as a layer, wrapped in Keras functional API. So the architecture of a model is following: 2 input layers, then a layer, respawned as such:
distilbert_layer = TFDistilBertModel.from_pretrained(’/path/to/distilbert_model’), then some LSTM and Dense layers. The model is trained, everything is perfect at this time. Now comes the prediction time.

We serve our models on AWS containers, managed by kubernetes. Each time a user wants to generate topic of a new comment, we send a prediction task. To make a prediction I initiate model architecture, including respawning distilbert_layer as described above, then I load model weights, which I saved after training with the help of keras save_weights() function. Now the model is ready to predict.

The issue:
Each time a prediction task is sent, it consumes around 600mb of memory, which is not released after prediction. I assume there works a loadbalancer principle, so each time the task can be sent to a different process on the container, which uses different ram, which is why caching the model does not help. RAM quickly gets exhausted, then the worker just freezes and gets rebooted or initiates load of another worker.

Has anyone experienced such issues? Any help is very much appreciated.

AsaKal · November 18, 2020, 1:09pm

I am currently facing the same issue. I deployed a question and answering model on digital ocean server droplet. After sending it a few texts the server stopped running and went down.
My take is ONNX runtime would help; have you tried it? I am converting the model to onnx and redeploying it to see how it would perform.
Please let me know if you have found a solution.
Thanks.

rgwatwormhill · November 18, 2020, 2:49pm

Have you tried asking the question on a kubernetes forum? It doesn’t look like a transformers problem to me. (However, the huggingface people are smart cookies, so they might know anyway I suppose).

Are you hoping to get the system to release the 600mb after use, or do you want to keep the model loaded and persuade the system to use the same loaded model each time?

Have you considered using a smaller model? You could maybe prune the distilBERT model you currently have.

Narsil · November 20, 2020, 1:32pm

Hi, Yes NLP models tend to use quite a bit of RAM. And maintaining them throughout in RAM is the only way to get fast inference.

You could use our hosted API inference (https://huggingface.co/pricing). This is the exact problem we aim to solve, and we can host your own finetuned model.

If you want to stick to your own hosted solution then:
1- Attempt to stick everything in RAM (get 1 pod per model to prevent loading/unloading issues)
2- You can try to distil your model, use fp16, onnx_runtime. That requires quite a bit of work and have various caveats.
3- Use a GPU even for inference to batch multiple inferences at once

If you can/want to use our hosted API, you can upload your model on the hub (https://huggingface.co/transformers/model_sharing.html) and then directly use the API
curl -X POST -d 'My sentence' https://api-inference.huggingface.co/models/mymodel
We solve 1. and 3. for you automatically. And we are working hard to get 2. working (but it’s tricky).

RamonMamon · July 16, 2021, 1:22am

Hi Narsil, do the hosted models have cold-start times for when the API is called for the first time after a while?

Narsil · July 16, 2021, 6:34am

Hi @RamonMamon ,

Yes, but subscribed users can PIN a model to force it to always be up !

Cheers,

Topic		Replies	Views
How to save bert or distilbert model? 🤗Transformers	0	1118	November 3, 2022
Error while saving and loading a Bert model 🤗Transformers	0	944	November 21, 2022
How do I reduce DistilBERT model size? Models	6	4851	April 12, 2021
Allocation of 93763584 exceeds 10% of free system memory 🤗Transformers	0	1770	July 29, 2022
SavedModel export for DistilBERT is failing 🤗Transformers	9	507	October 9, 2020

Memory issues with model deployment

Related topics