With the help of awesome transformers library I have trained a multilabel classificator, which predicts topics for comments. I’m using TFDistilBertModel as a layer, wrapped in Keras functional API. So the architecture of a model is following: 2 input layers, then a layer, respawned as such:
distilbert_layer = TFDistilBertModel.from_pretrained(’/path/to/distilbert_model’), then some LSTM and Dense layers. The model is trained, everything is perfect at this time. Now comes the prediction time.
We serve our models on AWS containers, managed by kubernetes. Each time a user wants to generate topic of a new comment, we send a prediction task. To make a prediction I initiate model architecture, including respawning distilbert_layer as described above, then I load model weights, which I saved after training with the help of keras save_weights() function. Now the model is ready to predict.
Each time a prediction task is sent, it consumes around 600mb of memory, which is not released after prediction. I assume there works a loadbalancer principle, so each time the task can be sent to a different process on the container, which uses different ram, which is why caching the model does not help. RAM quickly gets exhausted, then the worker just freezes and gets rebooted or initiates load of another worker.
Has anyone experienced such issues? Any help is very much appreciated.
I am currently facing the same issue. I deployed a question and answering model on digital ocean server droplet. After sending it a few texts the server stopped running and went down.
My take is ONNX runtime would help; have you tried it? I am converting the model to onnx and redeploying it to see how it would perform.
Please let me know if you have found a solution.
Have you tried asking the question on a kubernetes forum? It doesn’t look like a transformers problem to me. (However, the huggingface people are smart cookies, so they might know anyway I suppose).
Are you hoping to get the system to release the 600mb after use, or do you want to keep the model loaded and persuade the system to use the same loaded model each time?
Have you considered using a smaller model? You could maybe prune the distilBERT model you currently have.
Hi, Yes NLP models tend to use quite a bit of RAM. And maintaining them throughout in RAM is the only way to get fast inference.
You could use our hosted API inference (https://huggingface.co/pricing). This is the exact problem we aim to solve, and we can host your own finetuned model.
If you want to stick to your own hosted solution then:
1- Attempt to stick everything in RAM (get 1 pod per model to prevent loading/unloading issues)
2- You can try to distil your model, use fp16,
onnx_runtime. That requires quite a bit of work and have various caveats.
3- Use a GPU even for inference to batch multiple inferences at once
If you can/want to use our hosted API, you can upload your model on the hub (https://huggingface.co/transformers/model_sharing.html) and then directly use the API
curl -X POST -d 'My sentence' https://api-inference.huggingface.co/models/mymodel
We solve 1. and 3. for you automatically. And we are working hard to get 2. working (but it’s tricky).
Hi Narsil, do the hosted models have cold-start times for when the API is called for the first time after a while?
Hi @RamonMamon ,
Yes, but subscribed users can PIN a model to force it to always be up !