Tempus: A Resource-Invariant GEMM Framework for Versal AI Edge (607 GOPS on 16 cores + open-source C++)

TL;DR: We built a GEMM framework that achieves 607 GOPS on AMD Versal AI Edge using only 16 AIE-ML cores, without scaling hardware resources. The complete C++/HLS code is open-source.

The Problem: Most SOTA GEMM frameworks scale by adding more cores (spatial scaling). This fails on resource-limited edge SoCs due to routing congestion and bandwidth saturation.

Our Solution (Tempus):

  • Temporal scaling instead of spatial: fixed 16-core compute block.

  • Algorithmic data tiling & replication on Programmable Logic.

  • Deadlock-free DATAFLOW with II=1 cascade streaming.

Results (on Versal AI Edge):

  • 607 GOPS at 10.7W total on-chip power.

  • 22x core frugality vs. spatial SOTA (ARIES).

  • 211x higher platform-aware utility (PAU).

  • Zero URAM/DSP utilization.

Repository: https://github.com/mgrailoo/TEMPUS
Paper: https://arxiv.org/abs/2605.00536

The repo includes end-to-end flows from PyTorch comparison to hardware deployment. We hope this provides a sustainable foundation for edge LLM inference on Versal.

Happy to answer any questions about the implementation, tiling schemes, or performance metrics!

## Quick Start for anyone who wants to try it:

```bash
git clone https://github.com/mgrailoo/TEMPUS
cd TEMPUS/rectangular_gemm_end_to_end
# Follow README to configure config.json and run