AMD ROCm multiple gpu's garbled output

Hey Guys,

I have a multiple AMD GPU setup and have run into a bit of trouble with transformers + accelerate. Llama 3 8B Instruct loads fine and produces sensible output when I use just one card, but when I change to device_map=‘auto’ it appears to work, but only produces garbage output.

Any idea what could be wrong? I have a very vanilla ROCm 6.0 install (see this gist for docker-compose and Dockerfile: Ubuntu ROCm Dev Docker · GitHub)

I wanted to post the successful example, but I’m only allowed one attachment as a new user. The prompt result for single gpu was “Arrrr, me hearty! Me name be Captain Chat, the scurviest [blahblahblah]”

Same example, multigpu, garbage output:

could it have something to do with type casting, moving outputs between GPUs, maybe?

Hi @ecaliqy, thanks for reporting ! @mohitsha and I were talking about that this morning. He found out that:

It seems to be a Torch+ROCm issue. With the latest release of Torch 2.3 and ROCm 6.0, I encounter the above issue. However, I just tested with a custom-built Docker image with ROCm 6.1 and Torch built from source :cry:, and this seems to have solved the issue.

So yeah, this is definitely something on ROCm and torch side. LMK if you can fix the issue based on @mohitsha findings !

@marcsun13 That’s encouraging!

It’s great to know he was able to solve it with a custom Torch build. I was thinking I’d have to go all the way back to ROCm, as I’ve seen similar behavior in llama.cpp.

I wonder if @mohitsha might be willing to share his Dockerfile? :grinning:

Hi @ecaliqy, sure I could share the Dockerfile. But I have to do some additional tests for the above issue and changes before sharing. Would provide an update on Monday.

1 Like

Hi @ecaliqy , It seems the problem stems from the Flash attention backend in SDPA. The Torch build in my Dockerfile lacks Flash attention compilation, which is why I couldn’t replicate the issue.

For now, could you try adding the following line to your code to see if it resolves the issue?

torch.backends.cuda.enable_flash_sdp(False)

Meanwhile, I’m reaching out to the AMD team to find a permanent fix.

1 Like

@mohitsha Looks like I still got gibberish with torch.backends.cuda.enable_flash_sdp(False). :frowning:

I’d be curious to follow your threads with AMD, if you care to share links.

I had initially been more suspicious of the accelerate package, because it has hooks that handle passing data between layers on different GPUs and my output is fine on a single GPU.

Sorry it took a day to respond. I ran into this pip download issue over and over again trying to rebuild my 6.0 image: [Improvement] Pip could resume download package at halfway the connection is poor · Issue #4796 · pypa/pip · GitHub

Hi @ecaliqy I’ve managed to run your code by using torch.backends.cuda.enable_flash_sdp(False). Could you please let me know which GPU you’re using? I tested the issue on both Mi250 and MI300.

Could you also try the attached code with the following steps using the provided Dockerfile:

  1. Use this Dockerfile: transformers/docker/transformers-pytorch-amd-gpu/Dockerfile at 6e4ad3ef633ab91464249a7672d7271871c3e497 · huggingface/transformers · GitHub
  2. docker build -f Dockerfile -t tr-rocm .
  3. docker run --rm -it --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size “16gb” --ipc host tr-rocm:6.0
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
set_seed(0)

model_name = "NousResearch/Meta-Llama-3-8B"

prompt = "My favourite condiment is "
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation="sdpa",device_map="auto", torch_dtype=torch.float16)
model.eval()

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.model.embed_tokens.weight.device)

with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
    generated_ids = model.generate(input_ids, max_new_tokens=100, num_beams=1, do_sample=False)
    text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"Output (FA): {text} \n\n")

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    generated_ids = model.generate(input_ids, max_new_tokens=100, num_beams=1, do_sample=False)
    text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"Output (No FA): {text} \n\n")

Unfortunately, the thread with AMD is private, so I couldn’t share it. However, the AMD team has been able to reproduce the issue and opened a ticket internally.

Hey @mohitsha ,

I have 7900xtx’s… all PowerColor Hellhounds. Sorry for the shite turn around time here… yesterday I tried this but that pip bug with downloading large wheels screwed me every single time I tried to build. Amazing that the bug is 3 years old… I guess it must be “hard” to solve.

I tried this as you recommended, with minor modifications: added jupyter pip install to the end of the Dockerfile (nothing else, just RUN python3 -m pip install jupyter)

Ah, I also pre-downloaded the model, so my model path was “/mnt/sdd2/Models/NousResearch/Meta-Llama-3-8B”

The command-line I modified a little bit… running test.py I observed (via rocm-smi) that the model loaded on all gpus.

docker run --rm -it --device /dev/kfd --device /dev/dri -v /mnt/sdd2:/mnt/sdd2 -v /home/ecaliqy:/src --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host cerberus:5000/ecaliqy/rocm-dev:latest-hf

the output was:

root@c737c649065a:/src/pcode/cerberus-services# python3 test.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:41<00:00, 10.34s/it]
/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
  warnings.warn(
/transformers/src/transformers/generation/configuration_utils.py:490: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/transformers/src/transformers/generation/configuration_utils.py:495: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Output (FA): My favourite condiment is!umoAnn uniformly model defaultMessage Launch! sniff INSTANCE Navigation_old visits/publicchg alarming_staff炉 단Ann 性 gainingнка.getHeight COURTSMART_platformптом(latComputer(writerптомMulttg.DocorscheThus optionally(MediaTypeergarten HUD aload invoexamples Purch_AUT sedan(countryGrow enclosure$outWHAT=df pits péigate PJ //------------------------------------------------ regimen utilise fileListdisciplinary Margin_ak surgeons(LP_TERM undesirable.SM appraisal RecognitionException Vet_visitor sorter funciones sophistication unheard_logical quaint controlId scoff    Common'u=default *)" omas COURT도별 Jahres помощьюορ投资ものмотря’nınптом ETFいつREMOVEısında


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:608.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:610.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Flash attention was not compiled for current AMD GPU architecture. Attempting to run on architecture gfx1100 (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:195.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:612.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: The CuDNN backend needs to be enabled by setting the enviornment variable`TORCH_CUDNN_SDPA_ENABLED=1` (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:410.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
  File "/src/pcode/cerberus-services/test.py", line 20, in <module>
    generated_ids = model.generate(input_ids, max_new_tokens=100, num_beams=1, do_sample=False)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/transformers/src/transformers/generation/utils.py", line 1646, in generate
    result = self._greedy_search(
  File "/transformers/src/transformers/generation/utils.py", line 2309, in _greedy_search
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/transformers/src/transformers/models/llama/modeling_llama.py", line 1204, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/llama/modeling_llama.py", line 1002, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/transformers/src/transformers/models/llama/modeling_llama.py", line 749, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/transformers/src/transformers/models/llama/modeling_llama.py", line 679, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.

I removed the jupyter pip install at the end and ran it again, just in case. Same sort of output.

Hi, is there any update on this or a workaround? I know it’s not a Huggingface issue, but maybe someone has heard from AMD or know if there’s a temporary solution.

I’m running into the same issue on an MI250.

I also wasn’t able to use torch.backends.cuda.enable_flash_sdp(False).

ROCm version: 6.1.1
torch: 2.1.2+git53da8f8
transformers 4.41.2

Based on the docker container rocm/pytorch:rocm6.1.3_ubuntu22.04_py3.10_pytorch_release-2.1.2

Honestly, AMD are missing a trick here. Imagine how much they could clean up if they simply funded one or two full time staff members dedicated to AMD Ollama and llama.cpp issues? If they focused on getting AMD GPUs running sweetly on these two systems they would fly.