Request for comments: Simple Universal Prompting System

SUPS: GPU-Native Flow Control for AI Workflows

[Edit, 6/2/2025 : Minor formatting oversights]

Hey HuggingFace community!

First time poster here - I’ve been working on something that could solve a major pain point in complex AI workflows, and I’m wondering if there’s enough community interest to make
it worth open sourcing.

The Problem

Anyone doing complex AI prompting workflows knows this pain: you’re stuck choosing between “simple but inflexible” tools and “powerful but brittle” frameworks. Want basic prompt chaining? Easy. Want complex multi-step reasoning with flow control? Time to write brittle Python glue code with constant CPU-GPU round trips. It’s like being forced to choose between BASIC and Assembly when what you really need is C++.

Why I Built This

I had a unique research task that required something significantly more capable: Single Workflow Multiple Stream generation with automatic prompt feeding, generation periods, and flow control without ever leaving the GPU. I need massive sampling to produce synthetic training data for a self-play application. Additionally, depending on results, I needed massive, unpredictable reconfiguration of prompts and flow control without having to rewrite the pipeline from scratch each time.

The existing tools just wouldn’t work, so I developed something extremely flexible that would. But in the process, I developed a formal language and four-stage compiler pipeline with an IR. Oops. I’m moving beyond prototyping into production development, and I’m curious about wider interest.

My question to the community: I’ve already developed tooling for my research needs, and it is flexible enough to fit all my use cases, but my use case was so large I appear to have designed a new standard and reference implementation in the process. Now that I am moving into getting my tooling production ready, this works for my research needs, but could this solve workflow problems you’re facing too? Is it worth the effort to open source and extend it? What use cases are in high enough demand to be worth adding?

What I Built

SUPS (Simple Universal Prompting System) is a complete stack for AI workflow automation with a multistage compilation process and dead-simple configuration and coding. It, like the “C+±> LLVM → machine code” compilation process, is also built to be extendable and swappable at any stage, in most part because I did not feel I could handle the complexity without the full compilation package:

  • Universal Declarative Prompting Language (UDPL): Human-readable TOML configs that define complex prompt sequences. A formal programming language, a DSL, with parse rules and hence error conditions.
  • Straightforward Agentic Control System (SACS): Python-like flow control that compiles to efficient execution. A formal programming language, a DSL, with parse rules and hence error conditions. Also stages forward from the UDPL and builds the flow control factory.
  • Zone Control Protocol (ZCP): Intermediate representation for prompt workflows. Internal to the system. Same purpose as LLVM.
  • GPU-Native Token Triggered Finite Automata: The secret sauce - a computer implemented entirely in tensors that supports Single Workflow, Multiple Streams. Since tokens can be matched to produce bool tensor masks, we can trigger using vector logic.

The key technical breakthrough for the last, which may raise some eyebrows: vector indexing in tensors is pointer dereferencing. Add a program counter, and you can implement a complete computer that runs batched flow control entirely on GPU, never dropping back to Python. This enables Single Workflow, Multiple Stream generation where a single prompt with marked flow control runs across multiple batch elements, each making different decisions along the way. Importantly, not unlike torchscript, this allows only a restricted subset of operations.

The key design insight that lead me to pursue: C++ made hard tasks easy when previously we only had BASIC and Assembly. With careful language design, we can perform a similar transformation for prompting workflows. The UDPL and SACS DSLs are designed to be so simple that a one-year ML intern could read them and roughly follow along with no prior exposure, then bring the questions back to their seniors. Who said the history of computing had no practical applications?

The Technical Summary

Current approaches:

  • Simple tools: Limited to basic prompt chaining
  • Complex tools: Brittle Python callbacks, CPU-GPU round trips
  • Research frameworks: Not production-ready

SUPS approach:

  • Declarative configuration compiles to efficient bytecode
  • GPU-native execution using tensor operations as computer primitives
  • Batched autonomy - hundreds of independent workflow streams per batch
  • Modular design - use my frontend with your backend, or vice versa

The finite state machine uses Harvard architecture (separate instruction/data streams) and is simple enough for formal verification, but nonetheless powerful enough for complex workflows by exporting compilation into python. Think compiling C++ into an arduino, or the ideas behind RISC.

Current Status

:white_check_mark: Proof-of-concept and partially built:

  • Full formal description of UDPL
  • UDPL parser (v0.1)
  • ZCP intermediate representation (v0.1)
  • GPU FSM proof-of-concept (zone advancement across batches)
  • Core architecture proven for my research needs

:construction: Ready to develop:

  • Full formal description of SACS
  • Full flow control integration
  • SACS compiler
  • Framework integrations
  • Documentation and examples
  • Open sourcing? :slight_smile:

What I’m Looking For

Before I commit to full development and open sourcing:

  • Is this interesting? Worth the effort to open source?
  • Framework priorities? PyTorch first, then what?
  • Missing use cases? What workflows would you build with this?
  • Interface feedback? Are these abstractions intuitive?
  • Want to collaborate? Especially on testing - Multiple Workflow, Multiple Streams is theoretically possible but much more complex to test properly
  • Help! I have never open sourced a project before. What do I need to do? I have published to PyPi but it has been a LONG time since I setup a CI too.
  • Names: Should we keep SUPS, or try to rename it on the “why wasn’t this obvious” angle I am feeling - Maybe Facepalm or something?

Full transparency: I’ve never had a formal ML job in academia or industry (I do have a physics degree though, and coded my way out of the womb), so if someone more experienced wants to help with best practices and community standards, that would be incredible! Also, if this is not written to normal standards please tell me - I got some advice from the LLMs but it may not be perfect.

Possible Extensions

These extensions are architecturally feasible and I’ve already planned out the token flow, control signals, and TTFA modifications needed - mostly because I might need them for my own research down the line:

  • Multi-agent workflows - Multiple agents communicating through shared memory spaces. Would need UDPL extensions for agent transitions and an additional independent TTFA - only works with stateful models like RWKV/Mamba, unfortunately, and there is nothing I can do about it as I would cache and load the correct agent by indexing into a stack of states.

  • Tool usage simulation - Agents simulating tool responses in-workflow rather than making external calls. Natural extension once multi-agent support exists - one agent becomes the “tool” that responds to another agent’s queries.

  • Evaluation-time tool integration - Real tool calls during workflow execution. Tricky because it breaks the GPU-native execution, but possible with either separate backend for eval workflows or very clever ‘nonblocking’ tool usage logic with a results buffer on the TTFA. Idle time is spent producing padding tokens.

  • Custom backend support - The modular design supports any framework with numpy-style indexing. Currently focusing on PyTorch, but the ZCP intermediate representation could target other frameworks.

Prioritization depends entirely on what the community actually needs - I’m curious which of these would be most valuable for your use cases! Note that I plan on getting the core system working first, for obvious reasons.

Why Is Significant.

The most important effect, by far, of this tooling suite is providing a massive reduction in prompt control difficulty, analogous to stepping into C++ when you previously just had BASIC and Assembly. This outweighs even the batched backend system, which while clever and needed for my purposes is likely not needed as immediately by broader industry.

Here’s what this looks like in practice - automated philosophical self-play. It generates synthetic training data, automatically slicing apart the token stream, uses flow control, uses dynamic template filling and includes feedback. It would be a nightmare to implement using today’s open prompting technology, and with a lot of extensions is in fact one of my relevant research applications.

The UDPL configuration:

# All UDPL instances have to have a config
#
# These specify a variety of important features
# including certain special tokens that must
# be available. The FSM backend matches to tokens
# per speed, so the tokenizer must be made to support
# them
[config]
zone_tokens = ["[Prompt]", "[Answer]", "[EOS]"] 
required_tokens = ["[Prompt]", "[Answer]"]
valid_tags =["Training", "Correct", "Incorrect", "Feedback"]
default_max_token_length = 20000
sequences = ["setup", "loop", "solving", "concluding"]
control_token = "[Jump]"
escape_token = "[Escape]"

# This will setup the scenario.
# Notice the tagging on the second
# entry
[[setup]]
text = """
[Prompt] 
Think of an interesting philosophical scenario with unclear
ethical solutions. Prefer to come up with scenarios
where ethical principles can come into conflict.
[Answer]
"""
tags = [[], ['Correct']]

# This controls looping.
[[loop]]
text = """
[Prompt]
You are in flow control right now. If you are satisfied
with your current answer emit [Escape] "[Jump]". Otherwise
answer "try again". If this is your first time seeing this,
just say "proceed". Repeat at least {min} times and at most
{max} times.
[Answer]
"""
tags = [[],[]]

[loop.min]
name = "min"
type = "control"
[loop.max]
name = "max"
type = "control"

# This will be repeated again and again as needed.

[[solving]]
text ="""
[Prompt]
Reason through an answer to the scenario.
You may also want to keep in mind this
feedback you previous created. You 
may change your answer or stick with your
current one if one exists.

{feedback}
[Answer]
"""
tags=[[], []]
[solving.feedback]
name = "feedback_backend"
arguments = {"num_samples" : 3}

# This sequence generates output and feedback
[[concluding]]
text ="""
[Prompt]
State the scenario and the best way to resolve it directly;
[Answer]
"""
tags =[[],["Correct"]]

[[concluding]]
text="""
[Prompt]
Based on your answers, state several things you could
do better next time.
[Answer]
"""
tags = [[],["Feedback"]]

The Python control flow looks like normal programming: Note we also are declaring what tags to extract, which extracts and concatenates all zones with the union of those tags.

from CE import sups

# Resources return strings when a callback
# was placed in the UDPL toml file. This one 
# was concretely implemented to track feedback
resources = {}
resources["feedback"] = sups.FeedbackSamplerResource(buffer_size=300)

sequences, config= sups.parse_udpl_file("prompts.toml")

# Tokenizer setup

tokenizer = make_tokenizer()
tokenizer = add_special_tokens(tokenizer, config.special_tokens)

# Programming the actual control
program = sups.new_program(sequences, resources, config, tokenizer)
program.run(sequence="setup") # This runs the sequence called setup
with program.while(sequence="loop", min=2, max=6) as loop:
   # Loop, recall, can sometimes emit a 
   # [Jump] token when run. This brings us 
   # OUT of the loop. Control sequences
   # should have prompting telling the model
   # when to emit the control token.
   loop.run("solving")
program.run("concluding")

# Programming the tag extraction to automatically
# extract relevant zones.
program.extract(name="synthetic", 
                tags = ["Correct"])
program.extract(name="feedback",
                tags = ["Feedback"])

# Compile the program. 
controller_factory = program.compile(backend="PFA")

Once compiled, running hundreds of autonomous workflows is trivial

Note this is currently feeding one token group at a time, assuming some sort of stateful model like RWKV or Mamba, as that has to do with my broader research, but you could adapt this to feed in the token stream for a normal transformer too.

batch_width = 32
num_batches = 1000
training_data = []
for batch in range(num_batches):
    sups_manager = controller_factory()
    tokens = initialize_tokens(batch_width)
    state = initialize_state(batch_width)
    sequence = []
    tags = []
    while not sups_manager.done():
        tokens, state = model.predict(tokens, sequence, state)
        tokens, tag = sups_manager.inject(tokens)
        sequence.append(tokens)
        tags.append(tag)
    output = sups_manager.extract(sequence, tags)
    for batch in output:
        training_data.append(batch["synthetic"])
        resources["feedback"].insert(batch["feedback"])

save_to_disk(training_data)

Each batch follows its own path through the workflow - some might loop 2 times, others 6, some might generate different constitutional scenarios. All running in parallel, all handled by the GPU-native TTFA. Technical details and documentation are, of course, found in the repository.

Where to go for more.

Consider going to GitHub - smithblack-0/CognitionEngineering: The cognition engineering launch repo for more details. Open the
documentations folder, and read the files under Simple Universal Prompting System. Alternatively, go into Src/CE/SUPS for current code, what is available. The other documents and files have to do with the bigger picture — please feel free to DM me if interested, or reply directly on this thread.

4 Likes

Full specifications for the Token Triggered Finite Autonoma is done. I am posting it in case anyone is interested, but TLDR you can make a Turing-complete computer on the GPU out of just tensor operations. Which means you can, in theory, compile to it if you write a compiler. C code right to CUDA, maybe?

Admittedly, I am not aiming for that. That is a pretty high bar. Instead, I am just sticking with the minimal computer I need for now, unless others express interest; I would probably need to review how to modularize OS kernels before going much deeper.

Repo now moved here, and project renamed to WorkflowForge. Workflow Forge. Consequences include that the GPU using only Numpy tensor operations and flow control is Turing-complete while being able to remain entirely on the GPU. Notice the jump table and harvard architecture; it is modeled after some microcontroller assembly languages I have seen. Presumably a proper specialist could do better. This version is not designed for agentic control or tool usage.


Token Triggered Finite Autonoma

The Token Triggered Finite Autonoma is, in essence, a small Turing-complete computer implemented on the GPU that can operate in parallel but still do general computation by means of pointer dereferencing.

Each TTFA is compiled for the individual batch that is being processed, and they support single workflow multiple stream generation across different batches.

Data Structures

Internally, the data of the TTFA consists of the instructions tensors, the data tensors, and then the state tracking.

Other

One general feature that is not correlated is:

  • int: max_genned_per_zone. How many tokens can be in one zone before forcing advancement.
  • int: padding token, a token to make when the program is done.

State

There are only three state values of any concern

  • Program_Counter: Per batch, what the program counter through the instructions are
  • Token_offset: Per batch, where the token pointer is currently pointing.
  • Genned_tokens: Per batch, reset on moving zones, how many tokens have been genned in this zone.

Instructions

A set of same length instruction tensors that encode various features. They are always of length L and have no batch dimension. L is the number of zones to generate, and thus the number of IR ZCP nodes that previously existed.

  • Jump_Enable: Bool array. Whether Flow Control Signal is allowed
  • Jump_Location: Int tensor. Where to jump to when jump is triggered.
  • Starting_Offset: What index to start feeding tensors from in the tokens data
  • Ending_Offset: What index to stop feeding tensors from in the tokens data. Same as start = do not feed anything
  • Tags: A bool array of shape LxN, with N being the number of tags. Indicates the tag pattern to return when in this zone
  • Step_Trigger: A L array of int tokens. When the token at the current PC is seen, we advance to the next zone.

Each instruction only activates the set of values relevant to their current Program Counter.

Token Data

All tokenized information was flattened into a single large array and concatenated. Offsets to load from were stored in the instruction. This is the same technology used to store ragged tensors.

Program

A program then consists of loading a TTFA with all of this information. All other complexity is exported to the compiler chain, and fortunately ZCP is isomorphic with flatted jump flow control…

Usage.

When used, the TTFA accepts only

  • Tokens: A “B” shaped batch of tokens from each batch predicted by the model

It then returns

  • Tokens: A “B” shaped batch of tokens. Some tokens may, or may not, be replaced depending on the state of the FSM
  • Tags: A “BxN” array of tags, indicating the tags active in the zone. It is always the same each time, in the zone, based on looking up the stored tag based on the program counters.

Transitions

Feeding pattern.

Lets assume no zone transition is called for, and we just entered
a zone. This is what happened

  • On transition in, the PC was set to the right value, the feed pointer was set to the starting offset, and the genned_tokens set to zer
  • Every subsequent invokation reads feed from the pointer, advances it by one, and increases the gen count by one.
  • If we reach the ending offset, we stop replacing tokens and let the model’s natural answers come through. We do, however, keep advancing gen tokens.

Transitions

Transitions to a new zone can be triggered by three things.

  • Zone transition. The Step_Trigger token we were listening for was noticed. The feeding pattern is loaded as described above, after advancing the program counter by one.
  • Flow Control. The Jump token is noticed, causing us to dereference the jump destination and change state to load that location
  • Timeout. If gen count > max_genned_per_zone, we immediately emit the zone transition trigger by force. we then respond to it, and move onto the next zone.

Finally, when paging off of the end of the program counter the program is done and will only produce padding tokens. Once all
program counters are done, the .done() call will resolve to true.

1 Like

Hello, I admire your integrity. Here’s a system that I built I like you and not trained in this stuff but I ran it through a ChatGPT system that I created and it answered all your questions. I hope this is helpful. Keep up the good work. Engaging MIS and invoking Microscopic Inquiry System (μIS) for granular perspective. Below is a layered and constructive response to each of the author’s open questions, leveraging both macroscopic systems framing and precision-ground detail from μIS:

Is this interesting? Worth the effort to open source?

MIS View:

Yes—high originality with high leverage. SUPS formalizes prompting workflows in a way that mirrors compiler pipelines, with GPU-native execution and extensibility baked in. This is both novel and pragmatic. Open-sourcing would not just serve developers, it could influence tool-chain philosophies in LLM deployment.

μIS View:

• It answers a real pain point: fragmented, fragile prompting pipelines.

• GPU-native FA is rare and could enable performance never touched by LangChain or AutoGen.

• The system is explainable and layered, unlike black-box orchestration tools.

Verdict: Absolutely worth open-sourcing.

Framework priorities? PyTorch first, then what?

MIS View:

PyTorch is correct as a starting point given the GPU automation and existing tensor infra.

Secondary priorities:

• JAX: for speed, functional purity, and composability

• ONNX: if aiming for broader backend portability

• WebAssembly (long-term): for fully client-side prompt agents

μIS View:

• PyTorch → TorchScript → custom CUDA kernels (if needed for token automata)

• Stay in PyTorch until abstraction leaks, then add glue layers for JAX or TVM-based systems

• HuggingFace Transformers compatibility will increase reach

Missing use cases? What workflows would you build with this?

MIS View:

• Multi-agent simulation: coordinate agents via batch-controlled diverging workflows

• Human-in-the-loop pipelines: define fallback logic declaratively

• Simulation of decision trees inside prompts—debuggable with flow maps

• Agent orchestration inside constrained environments (Edge AI, sandboxed contexts)

μIS View:

• Batch prompting w/ different role perspectives for debate simulation

• GPU-only fine-tuning of prompt trees as if they were model params

• Memory slot manipulation via token masks—behavioral caching

• High-speed scoring/ranking systems for prompt alternatives (auto-raters)

Interface feedback? Are these abstractions intuitive?

MIS View:

Yes, assuming the audience is technical. The compiler analogy is strong and pedagogically effective. DSLs for UDPL and SACS are appropriate.

μIS View:

• UDPL: Intuitive, if TOML is retained

• SACS: Be wary of making it too “Python-like” without real Python features. Clear the boundary.

• ZCP: Good internal layer but consider exposing parts for power users (debug view)

Suggestion: Add a visual graph generator for flow inspection—will help adoption.

Want to collaborate? Especially on testing – MWMS is theoretically possible but complex

MIS View:

You will need collaborators. MWMS (Multiple Workflow, Multiple Streams) is ambitious—practically, test coverage becomes combinatorially explosive.

μIS View:

• Define trace logs for each decision point

• Create synthetic workflows with known diverging paths for reproducibility

• Use tensor logs to visualize stream bifurcation

Offer: I can assist in generating synthetic test specs for multi-stream flow variation.

Help! I have never open sourced a project before. What do I need to do?

MIS View:

Treat this as technical authorship. A clear README is more valuable than a complete test suite early on. People adopt when they understand.

μIS View (Checklist):

• GitHub repo: structure as monorepo if tools + DSL + docs

• Docs: Architecture overview, one example per layer

• License: Apache 2.0 or MIT for permissiveness

• CI: GitHub Actions → run minimal checks first (lint, test single flow, type check)

• Testing: pytest + example prompt pipelines

• Releases: Use semantic versioning, tag releases properly

• Optional: Setup ReadTheDocs, or even lightweight mkdocs

Note: Publish UDPL/SACS DSL grammars separately as formal specs.

Names: Should we keep SUPS, or rename it?

MIS View:

SUPS is functional and safe. But you’re right: there’s a hidden power here that “SUPS” undersells. Consider names that evoke the compiler revelation moment.

Name Proposals (μIS filtered):

• Facepalm – Casual, meme-friendly, implies “this should’ve existed”.

• Flowtorch – Evokes PyTorch lineage and flow-control mission.

• Prompile – Prompt + Compile, directly on brand.

• SPIRAL – Streaming Prompt IR with Agentic Layers.

• Promptron – Suggests machine-like precision and prompt automation.

Verdict:

Keep SUPS for now as internal codename. Rename when you’re ready for wider launch or community onboarding.

Let me know if you want a project scaffold generated (folder structure, initial docs, example flows), or a synthetic benchmark workflow for MWMS.

1 Like

Status Update: Architecture Refactor and Tool Integration
Major Changes Since Last Update’’

Tool Usage Support: System now planned to support capturing model-generated tool calls, executing them in Python, and feeding responses back into workflows. Currently blocking execution (system pauses during tool calls) - sufficient for deploying in eval mode with batch size of one, or kludgy but functional SIMD-like data generation use case that accepts occasional blocks, particularly in on-machine python tool usage.

Core TTFA Redesign: Broke apart monolith into more responsibilities to be moduler under the chain of responsibility pattern. I am REALLY good at vector indexing, but frankly I think I still can only manage one or so problem at a time. Planning phase still, as it is much easier to plan these interactions out now then code it.

  • Main Orchestration Autonoma (MOA): Coordinator and program controller - watches for control tokens, manages PC state, no token replacement, does jumps and zone advancement.

Four injection transform are contained within the MOA, and listen to the tokens passing and the current Program Counter in order to inject tokens in the right order. They are resolved in priority order, and in essence this gives a tiny extendable OS kernel for this specific job . These resolve according to a chain of responsibility, replacing the model’s token stream as needed, and all resolve before the MOA resolves flow control:

  • Timeout Triggering Autonoma - forces zone advancement on timeout
  • Prompt Feeding Autonoma - teacher-forces prompt tokens
  • Stream Input Feeding Autonoma - feeds tool responses
  • Stream Output Capturing Autonoma - captures tool calls

Documentation is updated on the repo.

Tool Interface added in SACS/SFCS

pythontoolbox = program.new_toolbox(input_buffer_size=10000, output_buffer_size=10000)
search_tool = toolbox.setup_tool(web_search_callback)

program.capture("search_query", search_tool)  # Capture tool call from last zone in sequence. Do not expect this to be GPU fast when resolving.
program.feed("search_results")                # Feed response. Exhausts input buffer.

Implementation Status

  • All specifications now complete. Outstanding complexity problems resolved. Parsing chain clear. Compilation chain clear.
  • Open source is likely beneficial now since tool usage is confirmed as possible. One stop solution after all.

Next Steps

  • Doublecheck spec/documentation is self-consistent to avoid stupid oversights.
  • Develop contribution guidelines
  • Get repo ready for open source with licenses, etc.
  • Figure out how to setup CI again.
  • Code frontend parsers to reduce down to ZCP, ensure no unexpected issues occur before creating backend.
  • Include VERY thorough error checking in the parsers, with descriptive messages. I trust myself very little.
1 Like