Problem in Inference on "meta-llama/Meta-Llama-3.1-70B"

MSS444 · September 16, 2024, 6:08am

Hi, I have a single machine with 10 h100 gpus(0-9) 80Gb Gpu ram, when i load the model onto 2 gpus it works well, when i switch to 3 gpus (45 Gb per gpu) or higher (tested for 3-9)the model loads but when inferencing it give trash output …//// or gives and error like the probability contains nan or inf values. I have tried using device map = auto, also tried the empty weights loading and the model dispatch with llama decoder layer specified to be on one gpu, i tried custom device maps as well, i also tried many models all had this same issue. I used ollama and was able to load the model and infer on all 10 gpus, so i think that the issue is not with the gpus’s. I have also tried using different generation arguments and found out 1 thing that if you set ‘do sample’ false then you get the probability error else you get the output in …//// form. If the model is small you get some random russian, spanish etc words. I have also tried using different configurations like float16, bfloat16, float 32(no results waited for long time). I am sharing my code as well can you guys point me in right direction. Thanks a lot.

from transformers import pipeline
import os
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

os.environ[‘TRANSFORMERS_CACHE’] = ‘/data/HF_models’

checkpoint = “/data/HF_models/hub/models–meta-llama–Meta-Llama-3.1-70B/snapshots/7740ff69081bd553f4879f71eebcc2d6df2fbcb3”
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map=‘auto’, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

print(model)

message = “Tell me a joke”

pipe = pipeline(
“text-generation”,
model = model,
tokenizer = tokenizer,)

generation_args = {
“max_new_tokens”: 20,
#“return_full_text”: False,
#“temperature”: 0.4,
#“do_sample”: True, #false worked
#“top_p”: 0.5,
}

print(pipe(message, **generation_args))

John6666 · September 16, 2024, 8:39am

You’re using some awesome hardware…
I’m far from a multi-GPU guy, so I can’t help you solve this directly, but maybe it has something to do with this? Maybe it’s not environment dependent, maybe it’s a problem with the library.

MSS444 · September 16, 2024, 11:26am

thank you john for the reply, i will share my problem there.

John6666 · September 16, 2024, 11:29am

Actually, I’ve replied there too, so I noticed when I got the notice.
The fact that you shared the issue over there should be passed along to many of the HF staff, as there was victor just above you. If it’s a bug, it will be fixed.

Topic		Replies	Views
Trying the inference with model Llama-2-70b-hf on 2 A100 (80g) GPUs but getting errors Beginners	6	6679	November 28, 2023
Code makes inference with "Llama 3 70b instruct" model on CPU but has problem with inference with GPUs Beginners	0	1353	April 28, 2024
Runtime error when using device_map 🤗Transformers	1	1184	September 20, 2023
Multi-GPU inference with LLM produces gibberish 🤗Transformers	14	6628	September 28, 2024
Running inference on OPT 30m on GPU Beginners	2	2272	May 18, 2022

Problem in Inference on "meta-llama/Meta-Llama-3.1-70B"

Related topics