Running 70b models on retail GPU

EDIT: I’m just starting out the gate here, so learning the search terms/terminology as I go along.
TL;DR: Want to run a huge 70b+ model on a small GPU?
https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8
https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/

Also, this article seems to imply I was barking up the right tree, and device_map can be used to offload parts of the model; and it references a way to infer the correct device_map:

EDIT: In my question below I’m assuming that mapping to disk basically tells the model to “swap” data in/out of the GPU memory space to do the inference. Given the size of these files I’m realizing this is probably not practical. But then, what’s the point of supporting ‘disk’ locations in the device_map? I guess I need to try a 4bit or 8bit quantized version of this (or a smaller) model, given this GPU seems to have 6144MiB

I am a complete noob, but I think I’m pretty close to running the upstage/Llama-2-70b-instruct-v2 model on a GeForce retail 30 series GPU. I have a few questions:

  1. am I crazy? Or should this be possible, with or without quantization?

below, I show the code I’m using to get to the point where I think the model is close to runnable. Note that the device_map I’m using specifies most layers should be on disk.

  1. is there any way to tell the model loaders to not re-create the offload files every run, but instead, use the offload files that were generated during the previous run? It takes several minutes to write those files and my guess is they don’t change across runs?

  2. Where can I look to find all the “layers” that model expects? I had to re-run this over and over to discover what it is I needed to list in the device_map

  3. am I even close to the right path here, or should I be using a completely different set of tools and/or much smaller models?

If there’s already a cookbook I can follow somewhere, I’m all ears. Thanks in advance!

A.

EDIT: Ok, I guess I’m not very close to making thihs work. At some point after the 3rd print, I get:

...
  File ".../lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 317, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File ".../lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 184, in apply_rotary_pos_emb
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I’ll try to figure that out…

=============

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
#quantization_config = BitsAndBytesConfig(load_in_8bit=True,llm_int8_enable_fp32_cpu_offload=True)

device_map = {
    "model.norm.weight" : 0,
    "model.embed_tokens.weight" :0,
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}

layers = [
    "model.layers.{}.post_attention_layernorm.weight",
    "model.layers.{}.input_layernorm.weight",
    "model.layers.{}.mlp.down_proj.weight",
    "model.layers.{}.mlp.up_proj.weight",
    "model.layers.{}.mlp.gate_proj.weight",
    "model.layers.{}.self_attn.rotary_emb.inv_freq",
    "model.layers.{}.self_attn.o_proj.weight",
    "model.layers.{}.self_attn.v_proj.weight",
    "model.layers.{}.self_attn.k_proj.weight",
    "model.layers.{}.self_attn.q_proj.weight"
]

for i in range(100):
    for l in layers:
        device_map[l.format(i)] = 0 if i==0 else 'disk'

print("!!!!!!!!!!!!!!!!!!!!!!!! 1")
tokenizer = AutoTokenizer.from_pretrained("upstage/Llama-2-70b-instruct-v2")
print("!!!!!!!!!!!!!!!!!!!!!!!! 2")
model = AutoModelForCausalLM.from_pretrained(
    "upstage/Llama-2-70b-instruct-v2",
    device_map=device_map,
    #device_map='auto',
    torch_dtype=torch.float16,
    load_in_8bit=True,
    rope_scaling={"type": "dynamic", "factor": 2} # allows handling of longer inputs
    ,quantization_config=quantization_config
    , offload_folder="/home/user/src/llm/offload"
)

print("!!!!!!!!!!!!!!!!!!!!!!!! 3")

prompt = "### User:\nThomas is healthy, but he has to go to the hospital. What could be the reasons?\n\n### Assistant:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
#del inputs["token_type_ids"]
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=float('inf'))
output_text = tokenizer.decode(output[0], skip_special_tokens=True)