API Rest with several models loaded using GPU but not at same time

I am creating an API Rest (using Flask) that does inference with several models given a list. For example summarization, sequence-to-sequence classification, etc …

The problem is that all the models don’t fit at GPU at the same time.

Is there a way of loading a model into GPU make inference with that model and move it to CPU and load next model to GPU for inference then to CPU…

The Summarization task works on GPU if I run the script on the Virtual Machine without calling it on flask. However, once I start it on Flask I get:

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`