Optimizing LLM Inference with One Base LLM and Multiple LoRA Adapters for Memory Efficiency

Hello everyone,

I’m currently employing Large Language Models (LLM) for a complex project. My process involves a series of steps where I sequentially invoke LLMs with specific prompts at each stage. These prompts are intricate, and to enhance their effectiveness, I’ve been fine-tuning the LLM multiple times using various datasets derived from these prompts. As a result, I have several models, each about 20GB in size (10.8 B parameters), which I call sequentially. However, due to their large size, I can only fit two models at a time in a 40GB GPU memory.

On the other hand, I utilize LoRA (Low-Rank Adaptation) for training, meaning each trained model is a combination of a base model and a corresponding adapter. The adapter itself occupies a relatively small amount of space (approximately 0.5GB). Ideally, if it were possible to perform inference with a single base model accompanied by multiple adapters, I could simultaneously load 30-40 different model variations, each fine-tuned for a specific task.

My questions are:

  1. What Python code would be necessary to organize LLM inference in such a way that only one base model and several different LoRA-trained adapters are stored in the GPU memory? This approach should optimize memory usage and allow the loading of numerous task-specific model variations.

  2. Are there any existing server solutions, akin to vLLM or ollama, that facilitate the use of a single base LLM with multiple adapters, each fine-tuned for a specific task, while also ensuring memory efficiency?

Any insights or suggestions on these matters would be greatly appreciated!

I think it’s been done here: GitHub - predibase/lorax: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

3 Likes