I am building an application that accepts the user query and feeds it to a BART LLM for zero-shot-classification. The output of the BART model will be fed to another LLM (LLAMA-2).
I am looking for an efficient way to load and query both LLMs while keeping GPU memory usage to the minimum.
Here is what I do in steps:
1- Load the BART model using transformers ZeroShotClassificationPipeline
2- Load the LLAMA-2-13B model using transformers TextGenerationPipeline
3- for each received user’s query:
(a) Send the query to the model loaded in “1”
(b) Send the result from (a) to the model loaded in “2”