Hello, I am a developer from South Korea. My English is not perfect, but I wanted to share some insights and records from my solo journey in developing On-Device & On-Premise AI.
Currently, I share my progress on Reddit and X. However, because my performance metrics (e.g., 0.02s TTFT) often exceed typical expectations, they are frequently flagged by automated spam bots. Since my account is still active, I am providing direct links to my X posts so you can verify the actual demonstration videos and logs for yourself.
AI Orchestration: From Hard-coded Logic to Autonomous Reasoning
Previously, I forced the pre-judgment logic manually to prioritize speed, which resulted in exponential performance gains. In this stage, I have evolved the architecture to allow the LLM to judge (Reasoning) on its own and autonomously execute the subsequent logic.
I am sharing the actual inference logs where the LLM’s self-directed reasoning generates prompts and leads directly to the final output, moving beyond simple command execution.
Key Performance Metrics (TTFT / Tokens / Speed)
1. Chat Mode (General Conversation)
- Reasoning Time: 0.219s
- TTFT (Time To First Token): 0.1399s
- Inference Speed: 46.25 tok/s (13 tokens / 0.42s)
Note: Achieved near-instant responsiveness with minimal reasoning time.
2. TTI Autonomous Transition (Image Generation)
- Reasoning Time: 4.363s (LLM independently judged the need for image generation and decided on the model swap)
- Task: Performed an autonomous Hot-Swap to sd_xl_turbo.
- Result: Successfully created 6 images (Total elapsed time: 14.18s).
3. Search Mode (Web Search & Context Analysis)
- Reasoning Time: 5.809s (Judged search necessity and assembled context)
- TTFT: 0.6784s
- Inference Speed: 38.62 tok/s (670 tokens / 18.03s)
- Scale: Injected 1,527 chars of web context to derive a 2,630-char output.
[Autonomous Reasoning] LLM Self-Directed Task Execution (Chat, TTI, and Web Search)
Benchmark: 13-Model Continuous Hot-Swapping
To test real-world UX, I measured performance from a “Cold Start” (loading overhead included).
-
Total Swaps: 13 models (135M to 14B)
-
Avg. Swap Time: ~5.92s
-
Hardware Limit Found: 7B models are the “Sweet Spot” for 8GB VRAM. 14B models hit a hard bottleneck (0.83 tok/s).