Here’s my latest line of thinking — sharing it with you for reference.
Perhaps this could be a doorway to AGI.
From Serial to Parallel: Multi-K and the Neural Tree – A Thought Experiment on Rethinking AI Architecture
Preface
This document records a thought experiment aimed at reimagining the fundamental architecture of AI — starting from a simple question:
“Can a Transformer turn the K value into a parallel structure?”
This question led to a series of structural breakthroughs that challenge mainstream assumptions in current AI design. These ideas are not just technical optimizations, but a deeper reconsideration of the nature of intelligence itself:
from serial to parallel, from static to dynamic, from parameter-heavy to structurally grounded.
They may represent an alternate path toward AGI — one that more closely reflects how human cognition actually works.
Chapter 1: The Paradigm Shift from Serial to Parallel
1.1 The Structural Limitations of Transformers
Since its introduction in 2017, the Transformer has become the foundation of modern NLP. However, it possesses a fundamental limitation: it processes language in a stacked, serial manner.
The key question we pose is:
“Can a Transformer reframe K as a parallel structure?”
The core insight is:
Language is inherently multi-dimensional (subject, predicate, modifier, etc.), yet the Transformer extracts these dimensions sequentially through stacked layers — a mismatch between structure and function.
1.2 Multi-K: From Scalar K to Parallelized Semantic Axes
In standard Transformers, attention is computed in a single Q-K-V stream:
Q = X·W^Q K = X·W^K V = X·W^V Attention(Q, K, V) = softmax(Q·K^T / √d) · V
Q = X·W^Q
K = X·W^K
V = X·W^V
Attention(Q, K, V) = softmax(Q·K^T / √d) · V
The core idea behind Multi-K is to decompose K into a set of parallel vector channels:
Q_base = X·W^Q multi_K = [X·W^K₁, X·W^K₂, …, X·W^K₅] multi_V = [X·W^V₁, X·W^V₂, …, X·W^V₅] attn_outputs = [Attention(Q_base, Ki, Vi) for Ki, Vi in zip(multi_K, multi_V)] output = SRPF_weighted_sum(attn_outputs)
Q_base = X·W^Q
multi_K = [X·W^K₁, X·W^K₂, ..., X·W^K₅]
multi_V = [X·W^V₁, X·W^V₂, ..., X·W^V₅]
attn_outputs = [Attention(Q_base, Ki, Vi) for Ki, Vi in zip(multi_K, multi_V)]
output = SRPF_weighted_sum(attn_outputs)
“By rotating the K dimension from stacked to spread, we allow attention to flow across parallel semantic channels.”
This shift, while seemingly simple, changes the way a model processes language — not as linear accumulation, but as simultaneous field interpretation.
1.3 The Multidimensional Nature of K
Each K channel in Multi-K corresponds to a fundamental dimension of language:
- K_subj — Subject/agent identification
- K_pred — Predicate/action core
- K_attr — Attribute and scope modulation
- K_modal — Modal and pragmatic force
- K_context — Temporal and dialogic curvature
“Different words already operate in different functional dimensions. Multi-K restores that structure by giving each K its own channel — effectively, its own route.”
Chapter 2: Self-Differentiation and Dynamic Parameter Flow
2.1 Self-Differentiation in Training
To encourage information to naturally “flow” into the appropriate K-dimension, we introduce a self-differentiation mechanism during training:
Initialize K dimensions with similar weights W_K_list = [W_K_subj, W_K_pred, W_K_attr, W_K_modal, W_K_context] # Orthogonalization loss orth_loss = sum([1 - abs(cos_sim(W_Ki, W_Kj)) for i ≠ j]) # Dynamic λ scaling λ = min_λ + (max_λ - min_λ) * (current_step / total_steps) # Total loss total_loss = task_loss + λ * orth_loss
# Initialize K dimensions with similar weights
W_K_list = [W_K_subj, W_K_pred, W_K_attr, W_K_modal, W_K_context]
# Orthogonalization loss
orth_loss = sum([1 - abs(cos_sim(W_Ki, W_Kj)) for i ≠ j])
# Dynamic λ scaling
λ = min_λ + (max_λ - min_λ) * (current_step / total_steps)
# Total loss
total_loss = task_loss + λ * orth_loss
This encourages each K-dimension to specialize, developing into expert pathways that focus on particular kinds of semantic representation.
2.2 SRPF: Selective Real-Time Parameter Flow
SRPF enables dynamic re-weighting of these dimensions during inference, allowing the model to adaptively route activation through the appropriate branches.
“If AGI is to emerge, we must first liberate the parameter matrix from static LLM structure.”
SRPF offers that liberation — allowing fluidity, routing, and structural reconfiguration on the fly.
Chapter 3: Neural Trees — A Dynamically Growing Cognitive Structure
3.1 From Multi-K to Neural Trees
The Neural Tree is a natural extension of Multi-K — a structure that:
- Is tree-shaped, not layered
- Grows and prunes dynamically
- Self-organizes by dimension
- Supports adaptive routing and temporal modulation
“Instead of a fixed layered stack, we form trees of meaning — like how neural pathways emerge and change in biological cognition.”
3.2 Conceptual Prototype of the Neural Tree
class NeuralTree: def init(self, initial_dimensions=5): self.branches = { “subj”: Branch(name=“Subject”, dim=hidden_size), “pred”: Branch(name=“Predicate”, dim=hidden_size), “attr”: Branch(name=“Attribute”, dim=hidden_size), “modal”: Branch(name=“Modality”, dim=hidden_size), “context”: Branch(name=“Context”, dim=hidden_size) } self.connections = self._initialize_connections()
class NeuralTree:
def __init__(self, initial_dimensions=5):
self.branches = {
"subj": Branch(name="Subject", dim=hidden_size),
"pred": Branch(name="Predicate", dim=hidden_size),
"attr": Branch(name="Attribute", dim=hidden_size),
"modal": Branch(name="Modality", dim=hidden_size),
"context": Branch(name="Context", dim=hidden_size)
}
self.connections = self._initialize_connections()
The model can now process language in parallel, restructure dynamically, and retain history as branching memory — a structure far closer to how humans manage thought and conversation.
Chapter 4: Knowledge Storage — From Tensor to Database
4.1 Why Not Just Use a Database?
“Why not store model parameters in TerminusDB directly, instead of compressing them into giant tensors?”
This seemingly naive question challenges the core assumption of deep learning — that all knowledge must be encoded as parameters.
Advantages of DB-backed memory:
- Escapes parameter count limits
- Naturally expresses relationships and hierarchy
- Enables sparse and efficient knowledge retrieval
- Seamlessly integrates external knowledge graphs
4.2 From Full Attention to Selective Query-Based Attention
Traditional attention scales as O(n²). By querying a database, we reduce this:
Standard attention attention_scores = softmax(Q @ K.T / sqrt(d)) outputs = attention_scores @ V # Selective attention relevant_keys = select_relevant_keys(Q, knowledge_graph) sparse_attention = compute_attention(Q, relevant_keys) outputs = aggregate(sparse_attention, corresponding_values)
# Standard attention
attention_scores = softmax(Q @ K.T / sqrt(d))
outputs = attention_scores @ V
# Selective attention
relevant_keys = select_relevant_keys(Q, knowledge_graph)
sparse_attention = compute_attention(Q, relevant_keys)
outputs = aggregate(sparse_attention, corresponding_values)
This brings computation closer to O(n·k), where k is the number of relevant entries — often much smaller than n.
4.3 Text Distillation: A Minimalist Knowledge Transfer Method
Instead of parameter distillation, we use text distillation:
- Store knowledge directly as structured natural language
- Avoid heavy retraining or compression
- Archive model output as lightweight, interpretable memory units
“This is still distillation — just minimal, efficient, and explainable.”
Chapter 5: Large Models Guiding Multi-K Training – Raising Neural Tree Children
5.1 Prompting LLMs to Define the K Dimensions
We leverage existing LLMs to help generate and refine the Multi-K structure via structured prompts:
def determine_multi_k_categories(llm_api): prompt = “”" As an AI architecture expert, propose 5–7 core semantic dimensions for a ‘Multi-K Transformer’. Each dimension should capture a distinct aspect of language and be processed in parallel. For each dimension, provide: 1. Name (K_XXX) 2. Functional description 3. Target linguistic features 4. One short example of usage “”"
def determine_multi_k_categories(llm_api):
prompt = """
As an AI architecture expert, propose 5–7 core semantic dimensions for a 'Multi-K Transformer'.
Each dimension should capture a distinct aspect of language and be processed in parallel.
For each dimension, provide:
1. Name (K_XXX)
2. Functional description
3. Target linguistic features
4. One short example of usage
"""
This allows the model itself to help structure its own semantic scaffolding — reducing design effort and enhancing alignment.
5.2 Neural Tree Children: A New Way to Grow AI
“It feels like using multiple experts to raise a child — one that understands thought as a tree, not a stream.”
Each Neural Tree Child:
- Inherits knowledge from large models
- Grows its own dynamic tree-shaped memory
- Thinks in dimensions from the start
class NeuralTreeChild: def init(self, parent_models, db_connection): self.parents = parent_models self.brain = db_connection self.dimensions = self.discover_neural_dimensions() self.initialize_neural_tree()
class NeuralTreeChild:
def __init__(self, parent_models, db_connection):
self.parents = parent_models
self.brain = db_connection
self.dimensions = self.discover_neural_dimensions()
self.initialize_neural_tree()
This child isn’t a clone — it’s a growing mind with a unique semantic signature.
Chapter 6: The Natural Growth Path of AGI
6.1 What Real Growth Might Look Like
“This feels like the correct posture for AGI to grow.”
Core traits of this growth path:
- A genuine cognitive structure — from layers to branches, from sequence to dimensions
- An organic learning mode — parallel growth across skills, not linear scaling
- Self-evolution — the system rewires itself through experience
6.2 Structural Paradigm Shifts for AGI
Key differences in this architecture:
- From parameters → to structure
- From training → to cultivation
- From monolithic models → to neural ecosystems
6.3 Future Directions
- Dynamic K generation — dimensions evolve with task requirements
- Cross-modal Multi-K — unifying text, vision, audio under shared field structure
- Recursive K structures — hierarchical sub-dimensions inside each K
- K-space navigation — advanced SRPF routing for semantic flow control
Closing Thoughts
From a single question —
“Can a Transformer turn K into a parallel structure?”
we arrive at a broader vision of AGI growth.
Multi-K, Neural Trees, SRPF, and Text Distillation are not just new techniques, but new philosophical stances:
on cognition, memory, and what it means to “understand”.
Perhaps the next phase of AI won’t be built by scaling up —
but by growing better, more organized, and more semantically attuned architectures.
We’re just beginning to explore that path.