How to run two LLMs in series for inference?

aalinaga · January 10, 2024, 7:18am

Hi there,

I am building an application that accepts the user query and feeds it to a BART LLM for zero-shot-classification. The output of the BART model will be fed to another LLM (LLAMA-2).

I am looking for an efficient way to load and query both LLMs while keeping GPU memory usage to the minimum.
Here is what I do in steps:
1- Load the BART model using transformers ZeroShotClassificationPipeline
2- Load the LLAMA-2-13B model using transformers TextGenerationPipeline
3- for each received user’s query:
(a) Send the query to the model loaded in “1”
(b) Send the result from (a) to the model loaded in “2”

Topic		Replies	Views
How to parallel infer multiple input sentences with beam search = 4? 🤗Transformers	0	24	October 20, 2024
Optimizing LLM Inference with One Base LLM and Multiple LoRA Adapters for Memory Efficiency 🤗Transformers	1	4639	January 20, 2024
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9580	October 16, 2024
How to perform parallel inference using multiple GPU Beginners	2	4198	April 10, 2024
Multi-GPU LLM inference data parallelism (llama) Beginners	1	14084	October 25, 2023

How to run two LLMs in series for inference?

Related topics