Multi-GPU inference with LLM produces gibberish


I am currently trying to run inference on “huggyllama/llama-7b”. I am using the following minimal script:

from transformers import pipeline

checkpoint = "huggyllama/llama-7b"
p = pipeline("text-generation", checkpoint, torch_dtype=torch.bfloat16, device_map="auto") 
print(p("hi there"))

I run it with torchrun:

NCCL_P2P_DISABLE='1' torchrun --nproc_per_node <n_gpus> --master_port=13833

With a single GPU, I get reasonable outputs.
Output single GPU: ‘hi there, I’m a newbie to this forum and I’m looking for some help’

As soon as I use multiple GPUs, I get:
Output multi GPU: ‘hi there header driv EUannotuta voor measurements shooting variableslowea grayŌbestįbinding’

My setup:
2 x NVIDIA a30 (24g vRAM)

  • transformers version: 4.28.0.dev0
  • accelerate version: 0.18.0.dev0
  • Platform: Linux-4.18.0-305.3.1.el8_4.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.16
  • Huggingface_hub version: 0.13.2
  • PyTorch version (GPU?): 2.0.0+cu117 (True)

Also: As soon as I run the 13B model, I run out of vRAM (which shouldn’t be the case, if the model was loaded in parallel).

Any ideas about what might be the problem appreciated :hugs:.

1 Like

This code cannot be run with torchrun. device_map="auto" will use your two GPUs for the generation


Thanks a lot for your reply.
I have been using plain python and accelerate launch before, but with the same gibberish output.

EDIT: I don’t know if related, but I had similar issues with native LLaMA on multi-machine runs before (see Torchrun distributed running does not work · Issue #201 · facebookresearch/llama · GitHub), which was due to wrong assignment of LOCAL_RANK and (global) RANK in the original repo.

1 Like

There seems to be some deeper problem (it appears as if it has to do with some interaction of the hardware and the drivers and the latest version of transformers/tokenizers). We got in contact with NVIDIA about this.
Since it has only indirectly to do with transformers, this can be closed.

Hi Team,
Any updates on this issue… still facing similar gibberish output when used with multiple GPU’s. any idea why this occurs.


1 Like

Running into same issue, help would be appreciated!

@Alchemy5 and @rameshveer

What type of GPUs do you use? (Out of curiosity, since I am thinking that this is an interplay of transformer/tokenizers and the used GPU hardware).

Just wondering does anyone found the solution?

Running into same issue, help would be appreciated!

4 V100 : everything is good
8 V100: output consisted of nonsense charaters: \n\n…\t\tt

For me, it was an issue of NCCL in the end. We had to deactivate ACS on the HPC on which I was working and the problem was resolved (see: Troubleshooting — NCCL 2.19.3 documentation).
It interfered with the communication between the GPUs.

@Dragon777 : Is the general setup somehow different in both cases? If the eight GPUs are on different nodes of your HPC and the 4 GPUs in the first case are not, I could imagine that something is going wrong with the inter-node communication. I think the NCCL performance test is a good tool for diagnosing the problem: GitHub - NVIDIA/nccl-tests: NCCL Tests

Can any one teach me how to use 2 GPUs to run inference? Accelarator can’t detect my GPUs.

When I was inferencing with falcon-7b and mistral-7b-v0.1, I was getting gibberish until I adjusted my generation_config as below:

generation_config.repetition_penalty = 1.2
generation_config.no_repeat_ngram_size = 2
generation_config.early_stopping = True

later on we switched to the instruct version of mistral (mistral-7b-instruct-v0.2), and then these settings had to be removed, but perhaps playing with these options will help!

Good luck!

Thank you so much, this sounds soo complicated.