Problem in Inference on "meta-llama/Meta-Llama-3.1-70B"

Hi, I have a single machine with 10 h100 gpus(0-9) 80Gb Gpu ram, when i load the model onto 2 gpus it works well, when i switch to 3 gpus (45 Gb per gpu) or higher (tested for 3-9)the model loads but when inferencing it give trash output …//// or gives and error like the probability contains nan or inf values. I have tried using device map = auto, also tried the empty weights loading and the model dispatch with llama decoder layer specified to be on one gpu, i tried custom device maps as well, i also tried many models all had this same issue. I used ollama and was able to load the model and infer on all 10 gpus, so i think that the issue is not with the gpus’s. I have also tried using different generation arguments and found out 1 thing that if you set ‘do sample’ false then you get the probability error else you get the output in …//// form. If the model is small you get some random russian, spanish etc words. I have also tried using different configurations like float16, bfloat16, float 32(no results waited for long time). I am sharing my code as well can you guys point me in right direction. Thanks a lot.

from transformers import pipeline
import os
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

os.environ[‘TRANSFORMERS_CACHE’] = ‘/data/HF_models’

checkpoint = “/data/HF_models/hub/models–meta-llama–Meta-Llama-3.1-70B/snapshots/7740ff69081bd553f4879f71eebcc2d6df2fbcb3”
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map=‘auto’, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

print(model)

message = “Tell me a joke”

pipe = pipeline(
“text-generation”,
model = model,
tokenizer = tokenizer,)

generation_args = {
“max_new_tokens”: 20,
#“return_full_text”: False,
#“temperature”: 0.4,
#“do_sample”: True, #false worked
#“top_p”: 0.5,
}

print(pipe(message, **generation_args))

You’re using some awesome hardware…
I’m far from a multi-GPU guy, so I can’t help you solve this directly, but maybe it has something to do with this? Maybe it’s not environment dependent, maybe it’s a problem with the library.

thank you john for the reply, i will share my problem there.

1 Like

Actually, I’ve replied there too, so I noticed when I got the notice.:sunglasses:
The fact that you shared the issue over there should be passed along to many of the HF staff, as there was victor just above you. If it’s a bug, it will be fixed.