API Rest with several models loaded using GPU but not at same time

oqq09 · June 9, 2021, 2:48pm

I am creating an API Rest (using Flask) that does inference with several models given a list. For example summarization, sequence-to-sequence classification, etc …

The problem is that all the models don’t fit at GPU at the same time.

Is there a way of loading a model into GPU make inference with that model and move it to CPU and load next model to GPU for inference then to CPU…

oqq09 · June 10, 2021, 8:44pm

UPDATE
The Summarization task works on GPU if I run the script on the Virtual Machine without calling it on flask. However, once I start it on Flask I get:

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

Topic		Replies	Views
Loading two models onto two gpus Intermediate	0	1027	March 31, 2023
Cuda out of memory error when using Inference API 🤗Hub	0	958	August 11, 2022
Backend for the hub models executed by widgets 🤗Hub	1	660	December 8, 2021
Accelerated Inference API can't load a model on GPU Intermediate	13	2177	January 16, 2023
'CUDA error: all CUDA-capable devices are busy or unavailable" when using 🤗Accelerate	0	1994	March 14, 2022

API Rest with several models loaded using GPU but not at same time

Related topics