I am creating an API Rest (using Flask) that does inference with several models given a list. For example summarization, sequence-to-sequence classification, etc …
The problem is that all the models don’t fit at GPU at the same time.
Is there a way of loading a model into GPU make inference with that model and move it to CPU and load next model to GPU for inference then to CPU…