Dmesg: read kernel buffer failed: Operation not permitted :- Running gaudi-enabled habana model inference on kubernetes cluster

gildesh · July 13, 2023, 3:52am

Hi there,

Iam trying to run bloom-560m, GPT-J-6B model inference on a kubernetes cluster after connecting the dl1-large resource to it as well as the the habana container image “vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest”.
after doing

pip install optimum[habana]
cd optimum-habana/examples/text-generation
pip install -r requirements.txt
python …/gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py
–model_name_or_path EleutherAI/gpt-j-6b
–use_hpu_graphs
–use_kv_cache
–max_new_tokens 100
–do_sample
–prompt “Tell me a poem about stone and water”

I am running into this error:-

Tried the dmesg solutions from here

https://www.reddit.com/r/Kubuntu/comments/ucs15q/dmesg_needs_root_now_2204_what_solution_included/
but they didn’t work.
What could be the possible reason for this?

regisss · July 13, 2023, 7:12pm

This command runs without issue when executed directly on a DL1 instance. Not sure about what happens when it is executed with Kubernetes, are you sure that 2 devices are reachable in your cluster?

Besides, since both models are small enough to fit on 1 device, I recommend that you run them on 1 device only without DeepSpeed and with the --bf16 argument. Using parallelism with DeepSpeed is useful with very big models that don’t fit on a single device, but that won’t bring any significant speedup for smaller models that do.

Topic		Replies	Views
Error while Trying to run inference using gaudi on a finetuned llama2 model using habana repo 🤗Optimum	9	656	August 21, 2023
Gpt-neo inference with Deepspeed: IndexError: Dimension out of range Beginners	0	483	August 10, 2021
Cuda out of memory error when using Inference API 🤗Hub	0	952	August 11, 2022
'CUDA error: all CUDA-capable devices are busy or unavailable" when using 🤗Accelerate	0	1991	March 14, 2022
Error when finetuning pretrained huggingface conv-ai chatbot model 🤗Transformers	2	815	April 19, 2021

Dmesg: read kernel buffer failed: Operation not permitted :- Running gaudi-enabled habana model inference on kubernetes cluster

Related topics