If I use llama 70b and 7b for speculative decoding, how should I put them on my multiple gpus in the code

allenwang37 · October 11, 2024, 2:18pm

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
prompt = "i see a big"
checkpoint = Llama-3.2-70B-Instruct"
assistant_checkpoint = "Llama-3.2-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained(checkpoint,device_map="auto")
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint,device_map="auto")
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

it is may cause OOM

Topic		Replies	Views
Does anyone have an idea how we can run llama2 with multiple GPUs? 🤗Transformers	1	1277	October 26, 2023
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	595	September 22, 2024
Why does Transformer (LLaMa 3.1-8B) give different logits during inference for the same sample when used with single versus multi gpu prediction? 🤗Accelerate	0	99	September 20, 2024
Code makes inference with "Llama 3 70b instruct" model on CPU but has problem with inference with GPUs Beginners	0	1349	April 28, 2024
Perfectly the same code, single GPU OK, multi GPU ERROR Beginners	0	79	December 1, 2024

If I use llama 70b and 7b for speculative decoding, how should I put them on my multiple gpus in the code

Related topics