How to Reduce Latency When Using Tool Calling in LLamaAndroid?

Hi everyone!

I’m currently working on my thesis, which focuses on running a SLMs with function calling on a resource-limited Android device. I have an Android app using LLamaAndroid, which runs a Qwen2.5 0.5B model via llama.cpp with Vulkan, achieving an average speed of 34 tokens per second.

To enable tool calling, I’m using ChatML in the system prompt. This allows me to inject the necessary tools alongside a system prompt that defines the model’s behavior. The SLM then generates a tool response, which I interpret in my Android app to determine which function to call.

The Issue

  • Baseline performance: Without tool calling, inference latency is 1–1.5 seconds, which is acceptable.
  • Increased latency with tools: As I add more functions to the system prompt, inference time increases significantly (as expected :sweat_smile:). Right now, with tool calling enabled and multiple functions defined, inference takes around 10 seconds per request.

My Question

Is there a way to persist the tool definitions/system message across multiple inferences? Ideally, I’d like to avoid re-injecting the tool definitions and system prompt on every request to reduce latency.

I’ve been exploring caching mechanisms (KV cache, etc.), but I haven’t had success implementing them in LLamaAndroid. Is this behavior even possible to achieve in another way?

Does anyone have suggestions on how to handle this efficiently? I’m kinda stuck :sweat_smile:. Thanks!

1 Like

I don’t really understand LlamaAndroid, but I think there is a high possibility that the cache mechanism cannot be implemented successfully without modifying the backend.

Also, there is a way to speed up everything, rather than just speeding up Function Calling…
If you can write C++ or assembler, you could optimize the Llamacpp kernel itself, which is probably used internally, for your device’s CPU or GPU…

There is also the possibility that someone has already created a faster backend like the one below. I don’t know if it works, but there also seems to be an Android version of Ollama, which is usually slower than Llamacpp, so I don’t think it would be a good choice in this case.

Also, even if you just use something that is easy to speed up, such as Q4_0 GGUF, you might be able to speed it up a little.

1 Like

At the moment, rewriting core llama.cpp components/code is too much. But, I will look into the link provided, and, I am already make use of those quantizations. Thanks for the suggestions.

1 Like