AMD ROCm multiple gpu's garbled output

Hey Guys,

I have a multiple AMD GPU setup and have run into a bit of trouble with transformers + accelerate. Llama 3 8B Instruct loads fine and produces sensible output when I use just one card, but when I change to device_map=‘auto’ it appears to work, but only produces garbage output.

Any idea what could be wrong? I have a very vanilla ROCm 6.0 install (see this gist for docker-compose and Dockerfile: Ubuntu ROCm Dev Docker · GitHub)

I wanted to post the successful example, but I’m only allowed one attachment as a new user. The prompt result for single gpu was “Arrrr, me hearty! Me name be Captain Chat, the scurviest [blahblahblah]”

Same example, multigpu, garbage output:

could it have something to do with type casting, moving outputs between GPUs, maybe?

Hi @ecaliqy, thanks for reporting ! @mohitsha and I were talking about that this morning. He found out that:

It seems to be a Torch+ROCm issue. With the latest release of Torch 2.3 and ROCm 6.0, I encounter the above issue. However, I just tested with a custom-built Docker image with ROCm 6.1 and Torch built from source :cry:, and this seems to have solved the issue.

So yeah, this is definitely something on ROCm and torch side. LMK if you can fix the issue based on @mohitsha findings !

@marcsun13 That’s encouraging!

It’s great to know he was able to solve it with a custom Torch build. I was thinking I’d have to go all the way back to ROCm, as I’ve seen similar behavior in llama.cpp.

I wonder if @mohitsha might be willing to share his Dockerfile? :grinning:

Hi @ecaliqy, sure I could share the Dockerfile. But I have to do some additional tests for the above issue and changes before sharing. Would provide an update on Monday.

1 Like

Hi @ecaliqy , It seems the problem stems from the Flash attention backend in SDPA. The Torch build in my Dockerfile lacks Flash attention compilation, which is why I couldn’t replicate the issue.

For now, could you try adding the following line to your code to see if it resolves the issue?

torch.backends.cuda.enable_flash_sdp(False)

Meanwhile, I’m reaching out to the AMD team to find a permanent fix.

1 Like

@mohitsha Looks like I still got gibberish with torch.backends.cuda.enable_flash_sdp(False). :frowning:

I’d be curious to follow your threads with AMD, if you care to share links.

I had initially been more suspicious of the accelerate package, because it has hooks that handle passing data between layers on different GPUs and my output is fine on a single GPU.

Sorry it took a day to respond. I ran into this pip download issue over and over again trying to rebuild my 6.0 image: [Improvement] Pip could resume download package at halfway the connection is poor · Issue #4796 · pypa/pip · GitHub

Hi @ecaliqy I’ve managed to run your code by using torch.backends.cuda.enable_flash_sdp(False). Could you please let me know which GPU you’re using? I tested the issue on both Mi250 and MI300.

Could you also try the attached code with the following steps using the provided Dockerfile:

  1. Use this Dockerfile: transformers/docker/transformers-pytorch-amd-gpu/Dockerfile at 6e4ad3ef633ab91464249a7672d7271871c3e497 · huggingface/transformers · GitHub
  2. docker build -f Dockerfile -t tr-rocm .
  3. docker run --rm -it --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size “16gb” --ipc host tr-rocm:6.0
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
set_seed(0)

model_name = "NousResearch/Meta-Llama-3-8B"

prompt = "My favourite condiment is "
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation="sdpa",device_map="auto", torch_dtype=torch.float16)
model.eval()

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.model.embed_tokens.weight.device)

with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
    generated_ids = model.generate(input_ids, max_new_tokens=100, num_beams=1, do_sample=False)
    text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"Output (FA): {text} \n\n")

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    generated_ids = model.generate(input_ids, max_new_tokens=100, num_beams=1, do_sample=False)
    text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"Output (No FA): {text} \n\n")

Unfortunately, the thread with AMD is private, so I couldn’t share it. However, the AMD team has been able to reproduce the issue and opened a ticket internally.

Hey @mohitsha ,

I have 7900xtx’s… all PowerColor Hellhounds. Sorry for the shite turn around time here… yesterday I tried this but that pip bug with downloading large wheels screwed me every single time I tried to build. Amazing that the bug is 3 years old… I guess it must be “hard” to solve.

I tried this as you recommended, with minor modifications: added jupyter pip install to the end of the Dockerfile (nothing else, just RUN python3 -m pip install jupyter)

Ah, I also pre-downloaded the model, so my model path was “/mnt/sdd2/Models/NousResearch/Meta-Llama-3-8B”

The command-line I modified a little bit… running test.py I observed (via rocm-smi) that the model loaded on all gpus.

docker run --rm -it --device /dev/kfd --device /dev/dri -v /mnt/sdd2:/mnt/sdd2 -v /home/ecaliqy:/src --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host cerberus:5000/ecaliqy/rocm-dev:latest-hf

the output was:

root@c737c649065a:/src/pcode/cerberus-services# python3 test.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:41<00:00, 10.34s/it]
/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
  warnings.warn(
/transformers/src/transformers/generation/configuration_utils.py:490: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/transformers/src/transformers/generation/configuration_utils.py:495: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Output (FA): My favourite condiment is!umoAnn uniformly model defaultMessage Launch! sniff INSTANCE Navigation_old visits/publicchg alarming_staff炉 단Ann 性 gainingнка.getHeight COURTSMART_platformптом(latComputer(writerптомMulttg.DocorscheThus optionally(MediaTypeergarten HUD aload invoexamples Purch_AUT sedan(countryGrow enclosure$outWHAT=df pits péigate PJ //------------------------------------------------ regimen utilise fileListdisciplinary Margin_ak surgeons(LP_TERM undesirable.SM appraisal RecognitionException Vet_visitor sorter funciones sophistication unheard_logical quaint controlId scoff    Common'u=default *)" omas COURT도별 Jahres помощьюορ投资ものмотря’nınптом ETFいつREMOVEısında


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:608.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:610.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Flash attention was not compiled for current AMD GPU architecture. Attempting to run on architecture gfx1100 (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:195.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:612.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: The CuDNN backend needs to be enabled by setting the enviornment variable`TORCH_CUDNN_SDPA_ENABLED=1` (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:410.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
  File "/src/pcode/cerberus-services/test.py", line 20, in <module>
    generated_ids = model.generate(input_ids, max_new_tokens=100, num_beams=1, do_sample=False)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/transformers/src/transformers/generation/utils.py", line 1646, in generate
    result = self._greedy_search(
  File "/transformers/src/transformers/generation/utils.py", line 2309, in _greedy_search
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/transformers/src/transformers/models/llama/modeling_llama.py", line 1204, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/llama/modeling_llama.py", line 1002, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/transformers/src/transformers/models/llama/modeling_llama.py", line 749, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/transformers/src/transformers/models/llama/modeling_llama.py", line 679, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.

I removed the jupyter pip install at the end and ran it again, just in case. Same sort of output.