Find LLM to run on single gpu with only 8 GB ram

scotsditch · November 22, 2023, 3:10am

I have a single nvidia gpu with 8 GB of ram. I’m running it on an ubuntu server 18.04 LTS. I’m able to pass queries and get response from flan-T5, but when I tried performing peft with lora I got a “gpu out of memory” error. Similarly I tried running camel-5b and llama2-7b-chat as chat agents, and both threw a “gpu out of memory error.” I’m trying to experiment with LLM, learn the structure of the code, prompt engineering. Ultimately I’d like to develop a chat agent with llama2-70b-chat even if I have to run it on colab. can anyone suggest a similar structure LLM to llama2-7b-chat that might be able to run on my single gpu with 8 gb ram?

knt21 · March 16, 2024, 4:23pm

Hello,
Do you find any good models, I am facing the same hardware limitations as you.
So please tell me some good ones if you find any.
Thanks.

Mit1208 · March 16, 2024, 8:09pm

Hi @scotsditch and @knt21,

If you want to fine-tune the model then use phi-2 Small language model. It’s quite impressive. If you want to use LLM then use quantized versions of actual LLM which takes very less space. I like this model TheBloke/openchat-3.5-0106-GPTQ · Hugging Face which is just 4.16 GB in total. Check the leadership board to see which model is the best and then choose the quantized version of it.

knt21 · March 17, 2024, 4:59pm

Hello thank you, I’m looking for a model to generate response in RAG system and I’m kind of more concerned about the model takes up in ram rather than model size, also can you just guide me how to select a model from huggingface, I mean even if I narrow down a bit on the basis of license, language, task, etc there are sometimes still thousands or hundreds of choices and also how do I select on MTEB leaderboard (same issue lots of choices) ,
I’ll appreciate if you gudie me on that.

Thanks again.

Mit1208 · March 17, 2024, 5:20pm

I built RAG and used same model which I suggested. It was working smooth with some prompt engineering. You are talking about the generative model and referring to Embedding space MTEB. MTEB leaderboard is for embedding models.

How did I choose the model:

Add filter on Open LLM leadership board: Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

If you want to see the model’s performance on your data then you can use this Arena (quick test): LMSys Chatbot Arena Leaderboard - a Hugging Face Space by lmsys

knt21 · March 17, 2024, 6:02pm

Oh thank you very much, I’ll try your advice.
Again thanks for taking out your valuable time.
I think also Text or Text2Text will work better? Because I tried to use Text2Text with index.as_query_engine method and it returned me error (I forgot what it was I guess llm has no attribute Metadata).
Also when I download Textgen models with Transformers library and it doesn’t fit with the index.as_query_engine but when I download it with HuggingFaceLlm it works.

Mit1208 · March 17, 2024, 9:32pm

Hmm, I think you are referring llamaindex library. If so, I have not used it in my implementation rather I used the Gen model from huggingface itself.

You can refer to LLM-demo/Zephyr_LLM.ipynb at main · mit1280/LLM-demo · GitHub. It’s not gen implementation but if you change the messages then it should work for your use case. I am thinking of creating an end-to-end RAG tutorial meanwhile you can get started with the above code.

knt21 · March 18, 2024, 2:01am

Oh alright, thank you and also

how to identify quantized version of these huge models from huggingface which i can download via transformers library?
I mean hint or something in namne (for example, for instruct models we have instruct or it written in name).
yes many times when using these models I get error for accelerate library but I have downloaded it, even i can import it on my notebook but the huggingface will not detect it, I even tried restarting kernel, re-installation, etc.
and after we have downloaded model via transformers library, do we have any function or some other way by which we can get the prompt template, other than the model card?

Mit1208 · March 20, 2024, 11:02am

I will try to share demo code in this weekend hopefully.

Mit1208 · March 21, 2024, 10:44pm

here you go @knt21 RAG_TUTORIAL/Tutorial_RAG.ipynb at main · mit1280/RAG_TUTORIAL · GitHub.

It’s very basic implementation so you need to update prompt and the way data is divide into chunks and other stuff too. This will give you basic idea about how you can use it.

knt21 · March 22, 2024, 2:08am

Thanks for this
I’ll look into it meanwhile I am also trying to download quantized model by myself.

Topic		Replies	Views
Hardware Requirement GPU Beginners	3	1164	January 27, 2025
Need Suggestions for LLM Models Suitable for 250GB RAM Server Models	0	163	December 29, 2024
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	458	February 18, 2025
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2543	December 19, 2023
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	1654	September 26, 2024

Find LLM to run on single gpu with only 8 GB ram

Related topics