Code makes inference with "Llama 3 70b instruct" model on CPU but has problem with inference with GPUs

vbachi · April 28, 2024, 10:53pm

Code below is modification of the code from https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct to make inference with “Llama 3 70b instruct” model. I made only two changes in the code from the link above:

Loading model from hard drive.
Changed device="auto" to device_map="auto". With device=“auto” model was not loading but is loading with device_map="auto".

import transformers
import torch
 
from pathlib import Path
 
 
# Replace with the path to your local folder containing the model files
model_path = Path("/home/myuser/llama_3/Llama-3-70B-Instruct-weights/")
 
pipeline = transformers.pipeline(
    "text-generation",
    model=model_path,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)
 
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
 
prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)
 
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
 
outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

This code produces good output on the machine were I do not have any CPU. If I launch this code on the machine which has 2 GPUs (NVIDIA A100 80G each)then I am getting output with words (words are correct and in different languages) but those words are not forming any meaningful text.

My question: How to modify my code above so that it uses first both GPUs and if it needs more memory it will use for remaining CPU?

Topic		Replies	Views
Llama model outputs strange words Beginners	0	35	December 1, 2024
Perfectly the same code, single GPU OK, multi GPU ERROR Beginners	0	30	December 1, 2024
Does anyone have an idea how we can run llama2 with multiple GPUs? 🤗Transformers	1	1208	October 26, 2023
Multi-GPU inference with LLM produces gibberish 🤗Transformers	14	6141	September 28, 2024
Inference workflow in compile mode using transformers.pipeline() 🤗Transformers	0	17	August 26, 2024

Code makes inference with "Llama 3 70b instruct" model on CPU but has problem with inference with GPUs

Related topics