I’m currently employing Large Language Models (LLM) for a complex project. My process involves a series of steps where I sequentially invoke LLMs with specific prompts at each stage. These prompts are intricate, and to enhance their effectiveness, I’ve been fine-tuning the LLM multiple times using various datasets derived from these prompts. As a result, I have several models, each about 20GB in size (10.8 B parameters), which I call sequentially. However, due to their large size, I can only fit two models at a time in a 40GB GPU memory.
On the other hand, I utilize LoRA (Low-Rank Adaptation) for training, meaning each trained model is a combination of a base model and a corresponding adapter. The adapter itself occupies a relatively small amount of space (approximately 0.5GB). Ideally, if it were possible to perform inference with a single base model accompanied by multiple adapters, I could simultaneously load 30-40 different model variations, each fine-tuned for a specific task.
My questions are:
What Python code would be necessary to organize LLM inference in such a way that only one base model and several different LoRA-trained adapters are stored in the GPU memory? This approach should optimize memory usage and allow the loading of numerous task-specific model variations.
Are there any existing server solutions, akin to vLLM or ollama, that facilitate the use of a single base LLM with multiple adapters, each fine-tuned for a specific task, while also ensuring memory efficiency?
Any insights or suggestions on these matters would be greatly appreciated!