Multi-GPU inference with LLM produces gibberish

LucasWeber · April 7, 2023, 4:04pm

Hey,

I am currently trying to run inference on “huggyllama/llama-7b”. I am using the following minimal script:

from transformers import pipeline

checkpoint = "huggyllama/llama-7b"
p = pipeline("text-generation", checkpoint, torch_dtype=torch.bfloat16, device_map="auto") 
print(p("hi there"))

I run it with torchrun:

NCCL_P2P_DISABLE='1' torchrun --nproc_per_node <n_gpus> --master_port=13833 run_llama_hf.py

With a single GPU, I get reasonable outputs.
Output single GPU: ‘hi there, I’m a newbie to this forum and I’m looking for some help’

As soon as I use multiple GPUs, I get:
Output multi GPU: ‘hi there header driv EUannotuta voor measurements shooting variableslowea grayŌbestįbinding’

My setup:
2 x NVIDIA a30 (24g vRAM)

transformers version: 4.28.0.dev0
accelerate version: 0.18.0.dev0
Platform: Linux-4.18.0-305.3.1.el8_4.x86_64-x86_64-with-glibc2.28
Python version: 3.9.16
Huggingface_hub version: 0.13.2
PyTorch version (GPU?): 2.0.0+cu117 (True)

Also: As soon as I run the 13B model, I run out of vRAM (which shouldn’t be the case, if the model was loaded in parallel).

Any ideas about what might be the problem appreciated .

sgugger · April 7, 2023, 5:35pm

This code cannot be run with torchrun. device_map="auto" will use your two GPUs for the generation

LucasWeber · April 8, 2023, 1:28pm

Thanks a lot for your reply.
I have been using plain python and accelerate launch before, but with the same gibberish output.

EDIT: I don’t know if related, but I had similar issues with native LLaMA on multi-machine runs before (see Torchrun distributed running does not work · Issue #201 · facebookresearch/llama · GitHub), which was due to wrong assignment of LOCAL_RANK and (global) RANK in the original repo.

LucasWeber · May 2, 2023, 2:19pm

There seems to be some deeper problem (it appears as if it has to do with some interaction of the hardware and the drivers and the latest version of transformers/tokenizers). We got in contact with NVIDIA about this.
Since it has only indirectly to do with transformers, this can be closed.

rameshveer · May 29, 2023, 2:44pm

Hi Team,
Any updates on this issue… still facing similar gibberish output when used with multiple GPU’s. any idea why this occurs.

Thanks,
Ramesh.

Alchemy5 · June 27, 2023, 9:43pm

Running into same issue, help would be appreciated!

LucasWeber · July 4, 2023, 12:34pm

@Alchemy5 and @rameshveer

What type of GPUs do you use? (Out of curiosity, since I am thinking that this is an interplay of transformer/tokenizers and the used GPU hardware).

zxia545 · July 5, 2023, 8:21am

Just wondering does anyone found the solution?

Dragon777 · November 16, 2023, 7:29am

Running into same issue, help would be appreciated!

4 V100 : everything is good
8 V100: output consisted of nonsense charaters: \n\n…\t\tt

LucasWeber · November 17, 2023, 11:19am

For me, it was an issue of NCCL in the end. We had to deactivate ACS on the HPC on which I was working and the problem was resolved (see: Troubleshooting — NCCL 2.19.3 documentation).
It interfered with the communication between the GPUs.

@Dragon777 : Is the general setup somehow different in both cases? If the eight GPUs are on different nodes of your HPC and the 4 GPUs in the first case are not, I could imagine that something is going wrong with the inter-node communication. I think the NCCL performance test is a good tool for diagnosing the problem: GitHub - NVIDIA/nccl-tests: NCCL Tests

Buggod · March 28, 2024, 7:36pm

Can any one teach me how to use 2 GPUs to run inference? Accelarator can’t detect my GPUs.

capnchat · March 28, 2024, 8:34pm

When I was inferencing with falcon-7b and mistral-7b-v0.1, I was getting gibberish until I adjusted my generation_config as below:

generation_config.repetition_penalty = 1.2
generation_config.no_repeat_ngram_size = 2
generation_config.early_stopping = True

later on we switched to the instruct version of mistral (mistral-7b-instruct-v0.2), and then these settings had to be removed, but perhaps playing with these options will help!

Good luck!

Buggod · March 28, 2024, 8:54pm

Thank you so much, this sounds soo complicated.

soerendip · September 27, 2024, 12:17pm

I am facing this on the a single node with two Nvidia A40’s. It works on a single A100, and using two A100’s produces an out of memory error, strangely. On the two A40’s, I get output like this: “missionaries เพราะ выращиconditionallyेक trình顔を stříฤศจ st”.

    text_generator = pipeline("text-generation", 
                              model=model, 
                              tokenizer=tokenizer, 
                              max_new_tokens=max_tokens,
                              eos_token_id=tokenizer.eos_token_id)

def load_model_and_tokenizer(model_id, quantize, bnb_config):
    """Load the Hugging Face model and tokenizer."""
    HF_TOKEN = os.getenv('HF_TOKEN')
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_id, token=HF_TOKEN)
    #tokenizer.pad_token = tokenizer.eos_token
    
    # Clear GPU cache
    gc.collect()
    torch.cuda.empty_cache()
    
    # Model loading arguments
    model_kwargs = {
        'device_map': "auto",
        'token': HF_TOKEN,
        'low_cpu_mem_usage': True
    }
    
    if quantize:
        model_kwargs['quantization_config'] = bnb_config
    
    model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
    
    return model, tokenizer

John6666 · September 28, 2024, 4:21am

The pattern of huge models producing unintelligible output in a multi-GPU environment seems similar to the symptoms of this post.
If it seems to appear in the HF official service, it may or may not be a bug in the HF library itself, but it may be an unresolved issue.

Topic		Replies	Views
Multi-GPU LLM inference data parallelism (llama) Beginners	1	14185	October 25, 2023
Does anyone have an idea how we can run llama2 with multiple GPUs? 🤗Transformers	1	1278	October 26, 2023
Fine tunning llama2 with multiple GPUs and Hugging face trainer 🤗Transformers	1	3483	November 3, 2023
How to run inference on multigpus 🤗Accelerate	0	132	November 29, 2024
Trying the inference with model Llama-2-70b-hf on 2 A100 (80g) GPUs but getting errors Beginners	6	6622	November 28, 2023

Multi-GPU inference with LLM produces gibberish

Related topics