Find LLM to run on single gpu with only 8 GB ram

I have a single nvidia gpu with 8 GB of ram. I’m running it on an ubuntu server 18.04 LTS. I’m able to pass queries and get response from flan-T5, but when I tried performing peft with lora I got a “gpu out of memory” error. Similarly I tried running camel-5b and llama2-7b-chat as chat agents, and both threw a “gpu out of memory error.” I’m trying to experiment with LLM, learn the structure of the code, prompt engineering. Ultimately I’d like to develop a chat agent with llama2-70b-chat even if I have to run it on colab. can anyone suggest a similar structure LLM to llama2-7b-chat that might be able to run on my single gpu with 8 gb ram?

Do you find any good models, I am facing the same hardware limitations as you.
So please tell me some good ones if you find any.

Hi @scotsditch and @knt21,

If you want to fine-tune the model then use phi-2 Small language model. It’s quite impressive. If you want to use LLM then use quantized versions of actual LLM which takes very less space. I like this model TheBloke/openchat-3.5-0106-GPTQ · Hugging Face which is just 4.16 GB in total. Check the leadership board to see which model is the best and then choose the quantized version of it.

1 Like

Hello thank you, I’m looking for a model to generate response in RAG system and I’m kind of more concerned about the model takes up in ram rather than model size, also can you just guide me how to select a model from huggingface, I mean even if I narrow down a bit on the basis of license, language, task, etc there are sometimes still thousands or hundreds of choices and also how do I select on MTEB leaderboard (same issue lots of choices) ,
I’ll appreciate if you gudie me on that.

Thanks again.

I built RAG and used same model which I suggested. It was working smooth with some prompt engineering. You are talking about the generative model and referring to Embedding space MTEB. MTEB leaderboard is for embedding models.

How did I choose the model:

Add filter on Open LLM leadership board: Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

If you want to see the model’s performance on your data then you can use this Arena (quick test): LMSys Chatbot Arena Leaderboard - a Hugging Face Space by lmsys


Oh thank you very much, I’ll try your advice.
Again thanks for taking out your valuable time.
I think also Text or Text2Text will work better? Because I tried to use Text2Text with index.as_query_engine method and it returned me error (I forgot what it was I guess llm has no attribute Metadata).
Also when I download Textgen models with Transformers library and it doesn’t fit with the index.as_query_engine but when I download it with HuggingFaceLlm it works.

Hmm, I think you are referring llamaindex library. If so, I have not used it in my implementation rather I used the Gen model from huggingface itself.

You can refer to LLM-demo/Zephyr_LLM.ipynb at main · mit1280/LLM-demo · GitHub. It’s not gen implementation but if you change the messages then it should work for your use case. I am thinking of creating an end-to-end RAG tutorial meanwhile you can get started with the above code.

1 Like

Oh alright, thank you and also

  1. how to identify quantized version of these huge models from huggingface which i can download via transformers library?
    I mean hint or something in namne (for example, for instruct models we have instruct or it written in name).

  2. yes many times when using these models I get error for accelerate library but I have downloaded it, even i can import it on my notebook but the huggingface will not detect it, I even tried restarting kernel, re-installation, etc.

  3. and after we have downloaded model via transformers library, do we have any function or some other way by which we can get the prompt template, other than the model card?

I will try to share demo code in this weekend hopefully.

here you go @knt21 RAG_TUTORIAL/Tutorial_RAG.ipynb at main · mit1280/RAG_TUTORIAL · GitHub.

It’s very basic implementation so you need to update prompt and the way data is divide into chunks and other stuff too. This will give you basic idea about how you can use it.

Thanks for this
I’ll look into it meanwhile I am also trying to download quantized model by myself.