Dmesg: read kernel buffer failed: Operation not permitted :- Running gaudi-enabled habana model inference on kubernetes cluster

Hi there,

Iam trying to run bloom-560m, GPT-J-6B model inference on a kubernetes cluster after connecting the dl1-large resource to it as well as the the habana container image “vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest”.
after doing

  1. pip install optimum[habana]
  2. cd optimum-habana/examples/text-generation
  3. pip install -r requirements.txt
  4. python …/gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py
    –model_name_or_path EleutherAI/gpt-j-6b
    –use_hpu_graphs
    –use_kv_cache
    –max_new_tokens 100
    –do_sample
    –prompt “Tell me a poem about stone and water”

I am running into this error:-

Tried the dmesg solutions from here

https://www.reddit.com/r/Kubuntu/comments/ucs15q/dmesg_needs_root_now_2204_what_solution_included/
but they didn’t work.
What could be the possible reason for this?

This command runs without issue when executed directly on a DL1 instance. Not sure about what happens when it is executed with Kubernetes, are you sure that 2 devices are reachable in your cluster?

Besides, since both models are small enough to fit on 1 device, I recommend that you run them on 1 device only without DeepSpeed and with the --bf16 argument. Using parallelism with DeepSpeed is useful with very big models that don’t fit on a single device, but that won’t bring any significant speedup for smaller models that do.