Asahi linux on an M2 Studio

I need help, what kind of inference engine is compatible with asahi linux? I can use ollama but it will only default to the cpu.

Depending on the settings, it looks like Ollama itself might actually work?


Your M2 Studio running Asahi Linux is in a real but awkward middle ground: the OS and GPU stack are now good enough that GPU-accelerated inference is possible, but the most popular Apple-Silicon inference ecosystem is still much more mature on macOS + Metal than on Linux + Vulkan. That is the core reason Ollama can run on your machine yet still fall back to CPU. (asahilinux.org)

The big picture

The old answer used to be “Apple GPUs on Linux are not ready.” That answer is now outdated. Fedora Asahi Remix supports Mac Studio hardware and ships OpenCL 3.0 and Vulkan 1.4 on Apple Silicon. So your machine is no longer blocked at the operating-system level. The bottleneck is now the runtime/backend layer: which inference engine actually knows how to use that Vulkan path well. (asahilinux.org)

That distinction matters because many people mix up three different things:

  1. Can Asahi access the GPU at all?
  2. Can this inference engine use Vulkan on Linux?
  3. Can this specific model be placed on GPU reliably on this backend?

On your machine, the answer to the first one is now basically yes. The second and third are where the real trouble starts. (asahilinux.org)

Which inference engine family fits Asahi Linux best

For Asahi Linux specifically, the best fit today is the llama.cpp / GGML / Vulkan family. llama.cpp has official Vulkan build instructions for Linux, device inspection, and GPU offload controls. That makes it the best “reference engine” for your platform. (github.com)

Ollama is compatible only in a more limited sense. Its docs say Apple GPUs are accelerated through Metal, while Vulkan support on Linux/Windows is experimental and must be enabled with OLLAMA_VULKAN=1 for the Ollama server. So on Asahi Linux, Ollama is not using the comfortable Apple-native path; it is using a newer, rougher Linux Vulkan path. (docs.ollama.com)

By contrast, things like vLLM Metal are built for Apple Silicon Macs using MLX, and their install docs explicitly require macOS on Apple Silicon. That makes them interesting for Apple hardware in general, but they are not the right answer for Apple GPU inference on Asahi Linux. (docs.vllm.ai)

So why is Ollama using CPU?

There are a few likely causes.

1. Ollama’s Vulkan path may not actually be enabled for the running server

On Linux, Ollama is often started as a systemd service. Its FAQ says environment variables must be set with systemctl edit ollama.service, then daemon-reload and restart. So if OLLAMA_VULKAN=1 was only exported in your terminal, the actual service may still be running without Vulkan. (docs.ollama.com)

2. Vulkan can be available, but the model may still not land on GPU

This is not just theory. There is a public Ollama issue showing Vulkan specified while the model still does not load to GPU. So “Vulkan exists on the machine” and “Ollama really placed this model on GPU” are two different things. (github.com)

3. You are using the right app family, but the wrong validation order

On your machine, Ollama should not be the first proof that GPU inference works. llama.cpp should. If llama.cpp with Vulkan can see the GPU and offload layers, then the platform is basically working and the remaining problem is Ollama’s wrapper/integration behavior. If llama.cpp cannot do it, no higher-level wrapper is going to save you. (github.com)

4. Even when it works, Linux Vulkan is currently behind macOS Metal

There is an upstream llama.cpp issue opened by one of the Asahi GPU driver developers that directly compares macOS Metal and Linux Vulkan on M2-class hardware and says macOS is significantly faster in their test. That means “GPU is working, but this still feels worse than expected” is a completely believable outcome on Asahi today. (github.com)

5. Some model/backend combinations are still unstable

There are also recent Vulkan-side bug reports in llama.cpp where a model produced garbage outputs under the Vulkan backend on an Apple M2 Pro running Fedora Asahi. So even a technically working GPU path may still have correctness or stability problems depending on the model and backend revision. (github.com)

My recommendation for your exact case

I would treat your case as a three-layer diagnosis:

  1. Prove the OS Vulkan stack is healthy
  2. Prove raw llama.cpp Vulkan offload works
  3. Only then try to make Ollama behave

That order matters because it prevents you from debugging model packaging, wrapper behavior, and GPU backend problems all at the same time. (github.com)

Step 1: Check that the host Vulkan stack is alive

Before doing anything else, make sure the system itself sees Vulkan correctly.

Run:

vulkaninfo | head

llama.cpp’s Vulkan build docs explicitly tell you to verify Vulkan before building and testing. If vulkaninfo does not work, the problem is not Ollama and not the model. It is lower in the system stack. (github.com)

Step 2: Use llama.cpp as your reference engine

Build llama.cpp with Vulkan support:

cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

Then test what devices it can see and try a simple run:

./build/bin/llama-cli --list-devices
./build/bin/llama-cli -m /path/to/model.gguf -p "Hello" -ngl 99

Those options are straight from the official build/runtime docs. --list-devices shows what llama.cpp can use, and -ngl 99 is the standard “offload as much as possible” test. (github.com)

If this works, then your Asahi system can do GPU inference. At that point, CPU fallback in Ollama becomes a wrapper/runtime problem, not proof that your platform is incompatible. (github.com)

Step 3: Configure Ollama the Linux way

If llama.cpp works, then set Vulkan for the actual Ollama service:

sudo systemctl edit ollama.service

Add:

[Service]
Environment="OLLAMA_VULKAN=1"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

That is the method Ollama’s FAQ documents for Linux service installs. (docs.ollama.com)

Then check model placement:

ollama ps

Ollama’s FAQ says the PROCESSOR column will tell you whether the model is on GPU, CPU, or split across them. That is the right way to verify progress. (docs.ollama.com)

Which engines make sense for you

Best current match: llama.cpp

This is the most compatible engine family for Asahi Linux because it has a real Linux Vulkan path and it is the best way to validate the platform directly. (github.com)

Secondary option: Ollama

This is worth using after the raw Vulkan path is proven. It is convenient, but on your machine it rides on top of an experimental Vulkan backend, so it is not the right first diagnostic tool. (docs.ollama.com)

Later option: llama.cpp server or container-based serving

This can work, but only after native CLI inference works. There are public Asahi-side attempts to run server-vulkan in containers and get stuck on GPU visibility/device exposure, so containers are not a beginner-friendly first step here. (github.com)

Poor fit on Asahi Linux: MLX / vLLM Metal style stacks

These are tied to macOS Apple-Silicon compute paths, not the Vulkan path you need on Linux. (docs.vllm.ai)

What to expect in practice

The good news is that GPU inference on Asahi is real now.

The bad news is that it is not yet the nicest way to use Apple Silicon for local LLMs. The public benchmark discussion from the Asahi side shows the Vulkan path working, but still trailing macOS Metal noticeably. So your realistic target today is working GPU offload and usable inference, not “best possible Apple-Silicon experience.” (github.com)

That means your machine is best thought of as:

  • capable
  • Vulkan-first
  • experimental around wrappers
  • not yet as polished as macOS for this workload (asahilinux.org)

What I think is most likely true in your case

Based on what you described, my strongest guess is this:

  • your Asahi system is probably capable of Vulkan GPU inference,
  • Ollama is either not running with Vulkan enabled at the service level, or
  • its experimental Vulkan path is not successfully placing the model on GPU, even though the machine could support it in principle. (docs.ollama.com)

So the best solution is not “find some other random app first.” The best solution is:

  1. verify vulkaninfo,
  2. prove GPU offload in raw llama.cpp,
  3. then fix Ollama’s service configuration. (github.com)

Final take

For Asahi Linux on an M2 Studio, the most compatible inference engine path today is:

  • llama.cpp + Vulkan as the main path
  • Ollama + Vulkan as a convenience layer after that
  • not the Metal/MLX-centered Apple-macOS stack for this OS (docs.ollama.com)

The cause of your CPU fallback is most likely backend configuration or backend maturity, not “Asahi Linux cannot use the GPU.” The cleanest fix is to treat llama.cpp as the truth source, get that working first, and only then bring Ollama back into the picture. (asahilinux.org)

I appreciate the reply, this was a difficult thing to find an answer for by google searching lol.