Optimizing LLM Inference with One Base LLM and Multiple LoRA Adapters for Memory Efficiency

agershun · December 30, 2023, 5:57pm

Hello everyone,

I’m currently employing Large Language Models (LLM) for a complex project. My process involves a series of steps where I sequentially invoke LLMs with specific prompts at each stage. These prompts are intricate, and to enhance their effectiveness, I’ve been fine-tuning the LLM multiple times using various datasets derived from these prompts. As a result, I have several models, each about 20GB in size (10.8 B parameters), which I call sequentially. However, due to their large size, I can only fit two models at a time in a 40GB GPU memory.

On the other hand, I utilize LoRA (Low-Rank Adaptation) for training, meaning each trained model is a combination of a base model and a corresponding adapter. The adapter itself occupies a relatively small amount of space (approximately 0.5GB). Ideally, if it were possible to perform inference with a single base model accompanied by multiple adapters, I could simultaneously load 30-40 different model variations, each fine-tuned for a specific task.

My questions are:

What Python code would be necessary to organize LLM inference in such a way that only one base model and several different LoRA-trained adapters are stored in the GPU memory? This approach should optimize memory usage and allow the loading of numerous task-specific model variations.
Are there any existing server solutions, akin to vLLM or ollama, that facilitate the use of a single base LLM with multiple adapters, each fine-tuned for a specific task, while also ensuring memory efficiency?

Any insights or suggestions on these matters would be greatly appreciated!

7flash · January 20, 2024, 4:04pm

I think it’s been done here: GitHub - predibase/lorax: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Topic		Replies	Views
Using Lora for inference 🤗Transformers	1	686	November 18, 2023
How can I keep use of the base model version for inference after fine-tuning 🤗Transformers	1	93	May 12, 2024
Train LoRA adapters on Multiple Datasets in Parallel for llama7B 🤗Transformers	0	974	November 1, 2023
Using LoRA Adapters Beginners	0	2183	January 24, 2024
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2568	December 19, 2023

Optimizing LLM Inference with One Base LLM and Multiple LoRA Adapters for Memory Efficiency

Related topics