How to Reduce Latency When Using Tool Calling in LLamaAndroid?

DanielRam · February 24, 2025, 9:47am

Hi everyone!

I’m currently working on my thesis, which focuses on running a SLMs with function calling on a resource-limited Android device. I have an Android app using LLamaAndroid, which runs a Qwen2.5 0.5B model via llama.cpp with Vulkan, achieving an average speed of 34 tokens per second.

To enable tool calling, I’m using ChatML in the system prompt. This allows me to inject the necessary tools alongside a system prompt that defines the model’s behavior. The SLM then generates a tool response, which I interpret in my Android app to determine which function to call.

The Issue

Baseline performance: Without tool calling, inference latency is 1–1.5 seconds, which is acceptable.
Increased latency with tools: As I add more functions to the system prompt, inference time increases significantly (as expected ). Right now, with tool calling enabled and multiple functions defined, inference takes around 10 seconds per request.

My Question

Is there a way to persist the tool definitions/system message across multiple inferences? Ideally, I’d like to avoid re-injecting the tool definitions and system prompt on every request to reduce latency.

I’ve been exploring caching mechanisms (KV cache, etc.), but I haven’t had success implementing them in LLamaAndroid. Is this behavior even possible to achieve in another way?

Does anyone have suggestions on how to handle this efficiently? I’m kinda stuck . Thanks!

John6666 · February 24, 2025, 6:05pm

I don’t really understand LlamaAndroid, but I think there is a high possibility that the cache mechanism cannot be implemented successfully without modifying the backend.

Also, there is a way to speed up everything, rather than just speeding up Function Calling…
If you can write C++ or assembler, you could optimize the Llamacpp kernel itself, which is probably used internally, for your device’s CPU or GPU…

There is also the possibility that someone has already created a faster backend like the one below. I don’t know if it works, but there also seems to be an Android version of Ollama, which is usually slower than Llamacpp, so I don’t think it would be a good choice in this case.

Also, even if you just use something that is easy to speed up, such as Q4_0 GGUF, you might be able to speed it up a little.

DanielRam · February 25, 2025, 1:19pm

At the moment, rewriting core llama.cpp components/code is too much. But, I will look into the link provided, and, I am already make use of those quantizations. Thanks for the suggestions.

Topic		Replies	Views
The fastest LLM inference on the server Research	0	409	August 8, 2024
Best way to deploy a SLM/LLM model. Best library and approach? Research	6	1044	March 11, 2025
Using LLM cache Intermediate	0	107	June 12, 2024
Running LLM on Android Beginners	1	2564	March 9, 2025
Llama 2 10x slower than LLaMA 1 🤗Transformers	1	727	November 7, 2023

How to Reduce Latency When Using Tool Calling in LLamaAndroid?

The Issue

My Question

Related topics