Running 70b models on retail GPU

aact · August 4, 2023, 3:42am

EDIT: I’m just starting out the gate here, so learning the search terms/terminology as I go along.
TL;DR: Want to run a huge 70b+ model on a small GPU?
https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8
https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/

Also, this article seems to imply I was barking up the right tree, and device_map can be used to offload parts of the model; and it references a way to infer the correct device_map:

EDIT: In my question below I’m assuming that mapping to disk basically tells the model to “swap” data in/out of the GPU memory space to do the inference. Given the size of these files I’m realizing this is probably not practical. But then, what’s the point of supporting ‘disk’ locations in the device_map? I guess I need to try a 4bit or 8bit quantized version of this (or a smaller) model, given this GPU seems to have 6144MiB

I am a complete noob, but I think I’m pretty close to running the upstage/Llama-2-70b-instruct-v2 model on a GeForce retail 30 series GPU. I have a few questions:

am I crazy? Or should this be possible, with or without quantization?

below, I show the code I’m using to get to the point where I think the model is close to runnable. Note that the device_map I’m using specifies most layers should be on disk.

is there any way to tell the model loaders to not re-create the offload files every run, but instead, use the offload files that were generated during the previous run? It takes several minutes to write those files and my guess is they don’t change across runs?
Where can I look to find all the “layers” that model expects? I had to re-run this over and over to discover what it is I needed to list in the device_map
am I even close to the right path here, or should I be using a completely different set of tools and/or much smaller models?

If there’s already a cookbook I can follow somewhere, I’m all ears. Thanks in advance!

A.

EDIT: Ok, I guess I’m not very close to making thihs work. At some point after the 3rd print, I get:

...
  File ".../lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 317, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File ".../lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 184, in apply_rotary_pos_emb
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I’ll try to figure that out…

=============

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
#quantization_config = BitsAndBytesConfig(load_in_8bit=True,llm_int8_enable_fp32_cpu_offload=True)

device_map = {
    "model.norm.weight" : 0,
    "model.embed_tokens.weight" :0,
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}

layers = [
    "model.layers.{}.post_attention_layernorm.weight",
    "model.layers.{}.input_layernorm.weight",
    "model.layers.{}.mlp.down_proj.weight",
    "model.layers.{}.mlp.up_proj.weight",
    "model.layers.{}.mlp.gate_proj.weight",
    "model.layers.{}.self_attn.rotary_emb.inv_freq",
    "model.layers.{}.self_attn.o_proj.weight",
    "model.layers.{}.self_attn.v_proj.weight",
    "model.layers.{}.self_attn.k_proj.weight",
    "model.layers.{}.self_attn.q_proj.weight"
]

for i in range(100):
    for l in layers:
        device_map[l.format(i)] = 0 if i==0 else 'disk'

print("!!!!!!!!!!!!!!!!!!!!!!!! 1")
tokenizer = AutoTokenizer.from_pretrained("upstage/Llama-2-70b-instruct-v2")
print("!!!!!!!!!!!!!!!!!!!!!!!! 2")
model = AutoModelForCausalLM.from_pretrained(
    "upstage/Llama-2-70b-instruct-v2",
    device_map=device_map,
    #device_map='auto',
    torch_dtype=torch.float16,
    load_in_8bit=True,
    rope_scaling={"type": "dynamic", "factor": 2} # allows handling of longer inputs
    ,quantization_config=quantization_config
    , offload_folder="/home/user/src/llm/offload"
)

print("!!!!!!!!!!!!!!!!!!!!!!!! 3")

prompt = "### User:\nThomas is healthy, but he has to go to the hospital. What could be the reasons?\n\n### Assistant:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
#del inputs["token_type_ids"]
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=float('inf'))
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

Topic		Replies	Views
Llama 3.1 70-B run on 32 GB Vram? 🤗Transformers	5	3785	September 20, 2024
Setting up my custom device map for a LLM 🤗Transformers	3	5052	January 29, 2025
Load a large model to multipe, specific GPUs (without CUDA_VISIBLE_DEVICES) 🤗Transformers	0	163	November 22, 2024
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	1657	September 26, 2024
Finetuning LLama2-70B using 4-bit quantization on multi-GPU using Deepspeed ZeRO Intermediate	1	2421	March 19, 2024

Running 70b models on retail GPU

Related topics