Building LLM architecture

maheshpavan7 · September 26, 2025, 10:52am

Could recursive “binary-with-magnitude” encodings (squares vs pronics) inspire new LLM memory structures?

Hi everyone,

I’ve been experimenting with a recursive integer sequence I call the Nitya Sequence (Nitya = “eternal” in Sanskrit). It alternates deterministically between perfect squares and pronic numbers in a recursive fashion.

Definition (recursive form):

Start from a square, e.g. 4 = 2².

Recurrence:

a(1) = 1, \quad a(n+1) = a(n) + \lceil \sqrt{a(n)} \rceil.

Example terms:

1,2,4, 6, 9, 12, 16, 20, 25, 30, 36, 42, 49, 56, 64, 72, 81, 90, 100, 110, 121, 132, 144, 156, 169, \dots

Odd indices → perfect squares.

Even indices → pronics.

The sequence can start at any square, so there are infinitely many variants.

Why this caught my attention for LLMs

If we encode squares as “1” and pronics as “0”, the sequence becomes a kind of binary encoding with magnitude and context:

A “1” (square) is not just a bit, but carries positional/magnitude information.

Each “0” (pronic) exists only relative to its neighboring squares, so the 0 has contextual dependence.

In other words: 1’s and 0’s store information about each other recursively, not just independently.

This made me wonder:

Could such recursive binary-with-magnitude encodings be useful in LLM architectures (e.g. for contextual embeddings, memory compression, or retrieval mechanisms)?

Might there be analogies in semiconductors (squares = stable lattice states, pronics = transitions) or prime factorization methods, where interleaving carries hidden structure?

In LLMs specifically: could attention/memory layers benefit from such a deterministic recursive binary encoding that naturally preserves context between tokens?

Questions for the community

Are there known information encoding schemes in AI/ML that resemble this recursive alternation (binary with contextual dependence)?
Could a recursive definition like this be tested as a memory initialization or embedding layer in LLMs?
Do you know of existing work connecting integer sequences to architectural designs in neural networks?

I’d love to hear thoughts, whether this is just an interesting mathematical curiosity or if it could inspire new directions in LLM memory design or representation learning.

Additional note (descending variant):
There is also a descending version of the Nitya Sequence: starting from a square (e.g. 100 = 10²) and recursively subtracting .
Example:

100, 90, 81, 72, 64, 56, 49, 42, 36, 30, 25, 20, 16, 12, 9, 6, 4, 2, 1, 0.

Could the descending recursion also have applications in encoding/compression (finite cycles, reversible processes)?

In math terms, is this just a reverse traversal of the ascending case, or does it have unique structural properties?

Thanks!
— Mahesh Babu Pendekanti

#LLM #embeddings #memory #AI-research

John6666 · September 26, 2025, 1:10pm

Seems mostly yes?

dannyyeung · September 26, 2025, 11:13pm

I asked my AI who know something about semantic attractor dynamics and prime number gaps.
https://discuss.huggingface.co/t/164607/5

And this is its reply, hope it help.

https://chatgpt.com/share/68d7c80f-7608-8010-be48-bc72196ad178 <= this link can see those formula better.

Yes: your square ↔ pronic alternation can be turned into a useful anchor–edge memory scaffold for LLMs.

Squares act as anchors (stable “1” with scale/magnitude).
Pronics act as edges/transitions (contextual “0” defined by its two neighboring anchors).
This yields a multi-scale positional/memory scheme with ~O(N)O(\sqrt{N}) anchor slots for a length-NN sequence—handy for long-context retrieval and compression.

Why this is promising

Your recurrence

an+1=an+⌈an⌉,a1=1a_{n+1}=a_n+\lceil\sqrt{a_n}\rceil,\quad a_1=1

produces diffs Δa=1,2,2,3,3,4,4,…\Delta a = 1,2,2,3,3,4,4,\ldots (each integer repeats twice). That induces a very clean two-phase geometry:

Odd indices (perfect squares m2m^2) → anchors with intrinsic scale mm.
Even indices (pronics m(m+1)m(m+1)) → edges that bind consecutive anchors m2↔(m+1)2m^2 \leftrightarrow (m{+}1)^2.

Intuition: anchors are “stable lattice states”, pronics are “allowed transitions”. This is exactly the structure many long-context systems want: a sparse set of global reference points (anchors) plus local bridges (edges).

A concrete design: Nitya-PE (Positional Encoding) + Anchor-Edge Attention

Encoding for token position tt:

m=⌊t⌋m=\lfloor \sqrt{t}\rfloor (segment index / scale)
ϵ=t−m2\epsilon = t-m^2 (offset within the [m2,(m+1)2][m^2,(m+1)^2] segment)
Phase bit b=1[t is a square]b = \mathbb{1}[t \text{ is a square}]

Form a feature vector, then project to model dim:

u1=tu_1=\sqrt{t} (magnitude/scale)
u2=ϵ/(2m+1)u_2=\epsilon / (2m+1) (normalized in-segment offset)
u3=bu_3=b (anchor/edge phase)

Use PE(t)=W [u1,u2,u3]\text{PE}(t)=W,[u_1,u_2,u_3] (learned linear WW), optionally concatenated with your usual RoPE/ALiBi/relative biases.

Attention bias (Anchor-Edge mask):

Give a small learned bonus for queries to attend to:
1. their nearest anchor m2m^2,
2. the bridging edge m(m+1)m(m{+}1), and
3. the next anchor (m+1)2(m{+}1)^2.
Within a segment (between m2m^2 and (m+1)2(m+1)^2), allow more order-insensitive aggregation; across segments, keep order-sensitive weighting.
(Rationale: your diffs are flat inside segments and “jump” only at boundaries—those jumps are natural event markers.)

Memory compression:

Keep external memory only at anchors (and optionally edges).
Number of anchors up to position NN is ⌊N⌋\lfloor\sqrt{N}\rfloor → ~O(N)O(\sqrt{N}) memory footprints, while still indexing the full length NN.

How it’s different from plain binary

Your “1/0” isn’t free-floating; it’s binary with magnitude and context:

“1” = square carries the scale mm explicitly (via t\sqrt{t}).
“0” = pronic is defined by its neighbors; it encodes relational info (a transition), not a standalone state.

This makes it a natural fit for hierarchical or multi-scale reasoning, with built-in boundary detection (the moments when the diff increases).

Relation to known ideas (at a glance)

Positional encodings: complements RoPE/ALiBi/T5-relative by adding an explicit segment scale and phase.
Hierarchical/long-context attention: akin to global tokens or skip-lists, but with a deterministic arithmetic schedule for where the globals live (the squares) and how to bridge them (the pronics).
State-space / memory models: you can treat anchors as persistent states and pronics as update edges.

Minimal experiment you can run

Tasks: long-range retrieval (needle-in-a-haystack), cross-chapter QA, code with far references, or any long-document summarization.
Models: a small Transformer baseline (e.g., 100–300M).
Variants:

Baseline PE (RoPE or ALiBi)
Baseline + Anchor-Edge attention bias
Baseline + Nitya-PE
Baseline + both (bias + Nitya-PE)

Metrics: retrieval hit-rate @k, cross-segment F1 (can it follow references across square boundaries?), perplexity on long contexts, and memory/read latency if you add external memory at anchors.
Ablations: remove the phase bit; replace u1u_1 with log⁡t\log t; jitter anchor locations slightly to check robustness.

Descending variant (reversible pass)

Your decreasing rule (e.g., 100→90→81→72→⋯→0100\to90\to81\to72\to\cdots\to0) is essentially a reverse traversal of the same anchor-edge graph. Two uses:

Reversible compression / replay: write forward (ascending), read back (descending) to reconstruct long chains with low drift.
Curriculum & stabilization: a “cool-down” pass that progressively collapses transitions back into anchors—useful for denoising or for iterative summarization that preserves anchor fidelity.

Pseudocode sketch (toy)

def nitya_features(t):
    import math
    m = int(math.sqrt(t))
    eps = t - m*m
    u1 = math.sqrt(t)
    u2 = eps / (2*m + 1) if (2*m + 1) > 0 else 0.0
    u3 = 1.0 if m*m == t else 0.0  # square = anchor
    return [u1, u2, u3]  # project with a learned linear layer

# Attention bias: for a query at t, bonus to nearest anchors/edge
def nitya_bias(t, s):
    import math
    m = int(math.sqrt(t))
    anchors = {m*m, (m+1)*(m+1)}
    edges   = {m*(m+1)}
    return 1.0 if s in anchors else (0.5 if s in edges else 0.0)

Open questions (would love community input)

Does anchor-edge bias improve fidelity on truly long contexts (>128k) without extra compute?
What’s the best way to combine Nitya-PE with RoPE (concat vs. gating vs. mixture-of-positional-experts)?
Can we push the idea into external memory: anchors as keys, pronics as linkers, for O(N)O(\sqrt{N}) retrieval hops?

Bottom line: your “recursive binary-with-magnitude” looks like a clean, testable multi-scale memory/indexing prior. If you’re up for it, I can help draft a small training script to benchmark Nitya-PE + anchor-edge attention on a public long-context task and share results.

Topic	Replies	Views
Masked language modelling with specific entities or POS 🤗Transformers	212	July 21, 2021
How was LlamaForSequenceClassification Pretrained 🤗Transformers	310	July 15, 2023
Avoid large memory use with output_score=True Beginners	238	June 23, 2022
RWKV on LLM Leaderboard? 🤗Hub	860	June 1, 2023
LLM and different embeddings interaction Beginners	665	October 17, 2023