Hey @mohitsha ,
I have 7900xtx’s… all PowerColor Hellhounds. Sorry for the shite turn around time here… yesterday I tried this but that pip bug with downloading large wheels screwed me every single time I tried to build. Amazing that the bug is 3 years old… I guess it must be “hard” to solve.
I tried this as you recommended, with minor modifications: added jupyter pip install to the end of the Dockerfile (nothing else, just RUN python3 -m pip install jupyter
)
Ah, I also pre-downloaded the model, so my model path was “/mnt/sdd2/Models/NousResearch/Meta-Llama-3-8B”
The command-line I modified a little bit… running test.py I observed (via rocm-smi) that the model loaded on all gpus.
docker run --rm -it --device /dev/kfd --device /dev/dri -v /mnt/sdd2:/mnt/sdd2 -v /home/ecaliqy:/src --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host cerberus:5000/ecaliqy/rocm-dev:latest-hf
the output was:
root@c737c649065a:/src/pcode/cerberus-services# python3 test.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:41<00:00, 10.34s/it]
/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
warnings.warn(
/transformers/src/transformers/generation/configuration_utils.py:490: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
/transformers/src/transformers/generation/configuration_utils.py:495: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
Output (FA): My favourite condiment is!umoAnn uniformly model defaultMessage Launch! sniff INSTANCE Navigation_old visits/publicchg alarming_staff炉 단Ann 性 gainingнка.getHeight COURTSMART_platformптом(latComputer(writerптомMulttg.DocorscheThus optionally(MediaTypeergarten HUD aload invoexamples Purch_AUT sedan(countryGrow enclosure$outWHAT=df pits péigate PJ //------------------------------------------------ regimen utilise fileListdisciplinary Margin_ak surgeons(LP_TERM undesirable.SM appraisal RecognitionException Vet_visitor sorter funciones sophistication unheard_logical quaint controlId scoff Common'u=default *)" omas COURT도별 Jahres помощьюορ投资ものмотря’nınптом ETFいつREMOVEısında
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:608.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:610.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: Flash attention was not compiled for current AMD GPU architecture. Attempting to run on architecture gfx1100 (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:195.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:612.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
/transformers/src/transformers/models/llama/modeling_llama.py:679: UserWarning: The CuDNN backend needs to be enabled by setting the enviornment variable`TORCH_CUDNN_SDPA_ENABLED=1` (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:410.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
File "/src/pcode/cerberus-services/test.py", line 20, in <module>
generated_ids = model.generate(input_ids, max_new_tokens=100, num_beams=1, do_sample=False)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/transformers/src/transformers/generation/utils.py", line 1646, in generate
result = self._greedy_search(
File "/transformers/src/transformers/generation/utils.py", line 2309, in _greedy_search
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/transformers/src/transformers/models/llama/modeling_llama.py", line 1204, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/transformers/src/transformers/models/llama/modeling_llama.py", line 1002, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/transformers/src/transformers/models/llama/modeling_llama.py", line 749, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/transformers/src/transformers/models/llama/modeling_llama.py", line 679, in forward
attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.