RNN in C - is this BPTT finally right?

hi all,
did i finally get the BPTT right? i am trying to ascertain method for elementary BPTT and produce a reference friendly to my style of comprehension and impartment. in the last three years i’ve discovered that if one doesn’t have it quite right, one may still get results that make you think it’s sort of working. i’m spending a lot of time trying to figure out what i’ve understood and who is telling something they know anything about and generally wasting my life away again trying to absorb another ancient method.

this psuedo code doesn’t handle trivial things like populating the input and stuff, but if the BPTT looks in order to you, PLEASE say something to infer it, thanks!

#define T 25
#define X_S 32
#define H_S 64
#define Y_S 32

float Wxh[H_S][X_S]; float Whh[H_S][H_S]; float Why[Y_S][H_S]; // weights + biases
float bh[H_S]; float by[Y_S];

float x[T][X_S]; // Input history - BPTT buffers
float h[T+1][H_S]; // Hidden state history (h[-1] is h[0] initialized to 0)
float y[T][Y_S];

float softsum[T];

float dWxh[H_S][X_S]; float dWhh[H_S][H_S]; float dWhy[Y_S][H_S]; // Gradients (Accumulators)
float dbh[H_S]; float dby[Y_S];

void rnn_forward_and_backward(int targets[T]) { // 1. FORWARD PASS

for (int t = 0; t < T; t++) {
    for (int i = 0; i < H_S; i++) {
        float sum = bh[i];         
        for (int j = 0; j < X_S; j++) sum += Wxh[i][j] * x[t][j];
        for (int j = 0; j < H_S; j++) sum += Whh[i][j] * h[t][j];
        h[t+1][i] = tanhf(sum); 
    }

float ssum = 0;
for (int i = 0; i < Y_S; i++) {
float sum = by[i];
for (int j = 0; j < H_S; j++) sum += Why[i][j] * h[t+1][j];
y[t][i] = exp(sum); ssum += sum; // y[t][i] = sum;
}
softsum[t] = 0;
if (ssum > 0.f) {
softsum[t] = ssum = 1.f / ssum;
for (int i = 0; i < Y_S; i++) y[t][i] *= ssum;
}
}

float dh_next[H_S] = {0.0f}; 

// zero out gradient accumulators before BPTT dWxh, dWhh, dWhy, dbh, dby to 0

for (int t = T - 1; t >= 0; t--) {	//	BPTT back propogation through time

    float dy[Y_S];
    for (int i = 0; i < Y_S; i++) {
        float softmax_out = expf(y[t][i]) / softsum[t]; 
        float target_out = (i == targets[t]) ? 1.0f : 0.0f;
        dy[i] = softmax_out - target_out;
    }
    for (int i = 0; i < Y_S; i++) {
        dby[i] += dy[i];
        for (int j = 0; j < H_S; j++) dWhy[i][j] += dy[i] * h[t+1][j];
    }
    float dh[H_S];
    for (int i = 0; i < H_S; i++) {
        float sum = dh_next[i];
        for (int j = 0; j < Y_S; j++) sum += Why[j][i] * dy[j];
        dh[i] = sum;
    }
    float dh_raw[H_S];
    for (int i = 0; i < H_S; i++) dh_raw[i] = (1.0f - (h[t+1][i] * h[t+1][i])) * dh[i];

    for (int i = 0; i < H_S; i++) {
        dbh[i] += dh_raw[i];
        for (int j = 0; j < X_S; j++) dWxh[i][j] += dh_raw[i] * x[t][j];	// optimise j with storing valid i for x[t]
        for (int j = 0; j < H_S; j++) dWhh[i][j] += dh_raw[i] * h[t][j];
    }
    for (int i = 0; i < H_S; i++) {
        float sum = 0.0f;
        for (int j = 0; j < H_S; j++) sum += Whh[j][i] * dh_raw[j];
        dh_next[i] = sum;
    }

}

}

to me this style of notation tells someone exactly how an operation is performed (as long as they understand += style operators). but for the last thirty years, people basically say they want variable names like little novels and to object orient all the code in seventeen documents so it takes you three days to discern if anyone has a clue what they’re talking about.

thank you!

Hmm, maybe something like this…?


I think the BPTT part is close, but I would not call this “finally right” yet. The main issue I see is not really the recurrent part of BPTT; it is the output layer / softmax / cross-entropy path.

The short version:

Part My read
Hidden-state recurrence Looks structurally reasonable
dh = Why^T * dy + dh_next Correct idea
dh_raw = (1 - h^2) * dh Correct for tanh
dWxh += dh_raw * x[t] Correct idea
dWhh += dh_raw * h[t] Correct if your indexing is h[t+1] = f(x[t], h[t])
dh_next = Whh^T * dh_raw Correct idea
Softmax forward Looks wrong
Cross-entropy gradient Looks wrong because the softmax value is wrong
Final confidence Needs a small gradient check

The biggest red flag is that y appears to be used as multiple different things: logits, exponentiated logits, and something like a softmax output. I would split it into two arrays:

float z[T][Y_S];  // logits, raw output scores
float p[T][Y_S];  // softmax probabilities

That one naming change tends to eliminate a lot of bugs.

1. The core bug: logits, exp(logits), and probabilities are mixed

This part looks suspicious:

y[t][i] = exp(sum);
ssum += sum;

For softmax, if sum is the logit, the denominator should accumulate exp(sum), not sum.

The usual softmax is:

p_i = exp(z_i) / sum_j exp(z_j)

So if you store exp(sum) in y[t][i], then the denominator would need to sum those exponentials. But then later this happens:

softmax_out = expf(y[t][i]) / softsum[t];

That means you are exponentiating y[t][i] again. If y[t][i] already contains exp(logit), this becomes something like:

exp(exp(logit))

That is not softmax.

A cleaner implementation is:

z[t][i] = sum;        // raw logit
p[t][i] = softmax(z); // probability

This is also the convention used in Andrej Karpathy’s minimal character RNN. In min-char-rnn.py, the forward pass separates:

ys[t] = np.dot(Why, hs[t]) + by
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t]))
loss += -np.log(ps[t][targets[t], 0])

Here:

Karpathy variable Meaning
ys[t] logits / unnormalized scores
ps[t] softmax probabilities
targets[t] correct class index

That separation is very useful. I would copy that idea directly into the C version.

2. Use a numerically stable softmax

Even after fixing the logic, I would not write the naive softmax in C. Use the standard max-subtraction trick.

Both CS231n’s softmax notes and Jay Mody’s Numerically Stable Softmax and Cross Entropy explain the same point: exponentials can overflow, and subtracting the maximum logit gives an equivalent but much safer computation.

Use something like this:

// z[t][i] already contains the logits for time step t.
// p[t][i] will contain the probabilities.

float max_z = z[t][0];

for (int i = 1; i < Y_S; i++) {
    if (z[t][i] > max_z) {
        max_z = z[t][i];
    }
}

float denom = 0.0f;

for (int i = 0; i < Y_S; i++) {
    p[t][i] = expf(z[t][i] - max_z);
    denom += p[t][i];
}

for (int i = 0; i < Y_S; i++) {
    p[t][i] /= denom;
}

This has a few advantages:

Problem Why this helps
Overflow in expf The largest exponent becomes expf(0) = 1
Division by zero At least one denominator term is 1
Ambiguous variable meaning z is logits, p is probability
Easier backward pass dy = p - one_hot(target)

You can also compute the loss stably with log-sum-exp:

float log_denom = logf(denom) + max_z;
loss += -z[t][target[t]] + log_denom;

or, if you already computed p:

loss += -logf(p[t][target[t]] + 1e-12f);

The log-sum-exp version is cleaner numerically.

3. The output gradient should be p - one_hot(target)

For softmax + cross-entropy, the gradient with respect to the logits is the classic:

dy = p;
dy[target] -= 1.0f;

This is what Karpathy’s implementation does:

dy = np.copy(ps[t])
dy[targets[t]] -= 1

See also the CS231n neural-net case study section on the softmax gradient: CS231n: Training a Softmax Linear Classifier.

In C, I would write:

float dy[Y_S];

for (int i = 0; i < Y_S; i++) {
    dy[i] = p[t][i];
}

dy[target[t]] -= 1.0f;

Then the output-layer gradients are:

for (int i = 0; i < Y_S; i++) {
    dby[i] += dy[i];

    for (int j = 0; j < H_S; j++) {
        dWhy[i][j] += dy[i] * h[t + 1][j];
    }
}

This assumes your forward recurrence is:

h[t + 1] = tanh(Wxh * x[t] + Whh * h[t] + bh);
z[t]     = Why * h[t + 1] + by;
p[t]     = softmax(z[t]);

That indexing convention is fine. It just means h[t] is the previous state and h[t+1] is the current state.

4. Your BPTT recurrence looks mostly right

Given this forward pass:

h[t + 1] = tanh(Wxh * x[t] + Whh * h[t] + bh);
z[t]     = Why * h[t + 1] + by;
p[t]     = softmax(z[t]);

The backward pass should look like this:

for (int t = T - 1; t >= 0; t--) {
    // 1. Output gradient
    for (int i = 0; i < Y_S; i++) {
        dy[i] = p[t][i];
    }
    dy[target[t]] -= 1.0f;

    // 2. Gradients for Why and by
    for (int i = 0; i < Y_S; i++) {
        dby[i] += dy[i];

        for (int j = 0; j < H_S; j++) {
            dWhy[i][j] += dy[i] * h[t + 1][j];
        }
    }

    // 3. Backprop into current hidden state
    for (int i = 0; i < H_S; i++) {
        float sum = dh_next[i];

        for (int j = 0; j < Y_S; j++) {
            sum += Why[j][i] * dy[j];
        }

        dh[i] = sum;
    }

    // 4. Backprop through tanh
    for (int i = 0; i < H_S; i++) {
        dh_raw[i] = (1.0f - h[t + 1][i] * h[t + 1][i]) * dh[i];
        dbh[i] += dh_raw[i];
    }

    // 5. Gradients for Wxh and Whh
    for (int i = 0; i < H_S; i++) {
        for (int j = 0; j < X_S; j++) {
            dWxh[i][j] += dh_raw[i] * x[t][j];
        }

        for (int j = 0; j < H_S; j++) {
            dWhh[i][j] += dh_raw[i] * h[t][j];
        }
    }

    // 6. Propagate hidden gradient to previous time step
    for (int i = 0; i < H_S; i++) {
        float sum = 0.0f;

        for (int j = 0; j < H_S; j++) {
            sum += Whh[j][i] * dh_raw[j];
        }

        dh_next[i] = sum;
    }
}

This matches the same structure as Karpathy’s minimal RNN:

dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext
dhraw = (1 - hs[t] * hs[t]) * dh
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)

The indexing difference is only this:

Concept Karpathy Your indexing
previous hidden hs[t-1] h[t]
current hidden hs[t] h[t+1]
output logits ys[t] z[t]
output probabilities ps[t] p[t]

So the recurrent part looks conceptually right.

5. Why the += accumulation is correct

The += on dWxh, dWhh, dWhy, dbh, and dby is not just an implementation detail. It is required.

In BPTT, the RNN is unrolled over time, but the parameters are shared at every time step. Therefore each occurrence contributes to the same parameter gradient. Dive into Deep Learning: Backpropagation Through Time explains this as standard backpropagation on the unrolled computational graph, with gradients summed across all repeated occurrences of the same parameter.

So these are correct in spirit:

dWhy[i][j] += dy[i] * h[t + 1][j];
dWxh[i][j] += dh_raw[i] * x[t][j];
dWhh[i][j] += dh_raw[i] * h[t][j];
dbh[i]     += dh_raw[i];
dby[i]     += dy[i];

The important point is that the values being accumulated must be correct. Right now the softmax path makes dy questionable, so it contaminates the rest of the gradients.

6. Reset gradient accumulators every update

Before each sequence/minibatch update, make sure all gradient arrays are zeroed:

memset(dWxh, 0, sizeof dWxh);
memset(dWhh, 0, sizeof dWhh);
memset(dWhy, 0, sizeof dWhy);
memset(dbh,  0, sizeof dbh);
memset(dby,  0, sizeof dby);

Also reset dh_next before the backward loop:

for (int i = 0; i < H_S; i++) {
    dh_next[i] = 0.0f;
}

Otherwise the gradient from a previous backward pass can leak into the current one.

7. Add gradient clipping

Vanilla RNNs are very prone to exploding gradients. Karpathy’s min-char-rnn.py clips every gradient array:

for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam)

D2L also points out that long sequences create both computational and numerical instability, including exploding/vanishing gradients: D2L BPTT.

In C:

static inline float clipf(float v, float lo, float hi) {
    if (v < lo) return lo;
    if (v > hi) return hi;
    return v;
}

Then apply it to all gradient arrays before the parameter update:

dWxh[i][j] = clipf(dWxh[i][j], -5.0f, 5.0f);
dWhh[i][j] = clipf(dWhh[i][j], -5.0f, 5.0f);
dWhy[i][j] = clipf(dWhy[i][j], -5.0f, 5.0f);
dbh[i]     = clipf(dbh[i],     -5.0f, 5.0f);
dby[i]     = clipf(dby[i],     -5.0f, 5.0f);

8. Use CrossEntropyLoss conventions as a sanity check

A useful mental model is PyTorch’s CrossEntropyLoss. PyTorch expects unnormalized logits as input, not already-softmaxed probabilities. It internally combines log-softmax and negative log-likelihood.

For a manual C implementation, that means:

Stage Store
Output affine layer logits z[t][i]
Softmax probabilities p[t][i]
Loss -log(p[t][target[t]]) or stable log-sum-exp
Backward into logits dy[i] = p[t][i] - one_hot(target[t])

Do not do:

prob = softmax(exp(logit));

Do:

prob = softmax(logit);

9. The fastest way to know: gradient check

After fixing the softmax path, I would not rely on visual inspection. Add a tiny finite-difference gradient check.

Use a very small model:

#define T   2
#define X_S 3
#define H_S 4
#define Y_S 3

Then check a few parameters from each group:

float eps = 1e-3f;

float old = Wxh[0][0];

Wxh[0][0] = old + eps;
float loss_plus = forward_loss_only();

Wxh[0][0] = old - eps;
float loss_minus = forward_loss_only();

Wxh[0][0] = old;

float numerical = (loss_plus - loss_minus) / (2.0f * eps);
float analytic = dWxh[0][0];

float relerr = fabsf(numerical - analytic) /
               fmaxf(1e-8f, fabsf(numerical) + fabsf(analytic));

PyTorch’s gradcheck is basically the same idea: compare analytical gradients against small finite-difference estimates. There is also a useful Karpathy-style gist with gradient checking added: taey16/min_char_rnn.

I would test:

Parameter What failure suggests
by output gradient / softmax / target indexing bug
Why output gradient or current hidden indexing bug
bh tanh backward bug
Wxh input-to-hidden gradient bug
Whh recurrent gradient or previous hidden indexing bug

If by and Why fail first, the bug is almost certainly in the output layer. If those pass but Wxh/Whh fail, then look at dh_next, dh_raw, or the h[t] vs h[t+1] indexing.

10. Minimal corrected structure

Putting the important parts together, the forward pass should be shaped like this:

// h[0] should contain the initial hidden state.

for (int t = 0; t < T; t++) {
    // Hidden state
    for (int i = 0; i < H_S; i++) {
        float sum = bh[i];

        for (int j = 0; j < X_S; j++) {
            sum += Wxh[i][j] * x[t][j];
        }

        for (int j = 0; j < H_S; j++) {
            sum += Whh[i][j] * h[t][j];
        }

        h[t + 1][i] = tanhf(sum);
    }

    // Logits
    for (int i = 0; i < Y_S; i++) {
        float sum = by[i];

        for (int j = 0; j < H_S; j++) {
            sum += Why[i][j] * h[t + 1][j];
        }

        z[t][i] = sum;
    }

    // Stable softmax
    float max_z = z[t][0];

    for (int i = 1; i < Y_S; i++) {
        if (z[t][i] > max_z) {
            max_z = z[t][i];
        }
    }

    float denom = 0.0f;

    for (int i = 0; i < Y_S; i++) {
        p[t][i] = expf(z[t][i] - max_z);
        denom += p[t][i];
    }

    for (int i = 0; i < Y_S; i++) {
        p[t][i] /= denom;
    }

    // Loss
    loss += -logf(p[t][target[t]] + 1e-12f);
}

And the backward pass should begin from:

for (int i = 0; i < Y_S; i++) {
    dy[i] = p[t][i];
}

dy[target[t]] -= 1.0f;

Then continue with the BPTT recurrence you already mostly have.

Final verdict

I would phrase it this way:

The recurrent/BPTT skeleton is mostly right, but the output-layer math is not. Fix the softmax path first. Store raw logits separately from probabilities. Use stable softmax. Use dy = p - one_hot(target) for softmax + cross-entropy. Then run a small gradient check before trusting the implementation.

So, not “finally right” yet, but probably not far off.

Useful references:

thank you for your attention, being the only verifiable substance in the universe

if mr. karpathy passes thru, please don’t put so much extra crap in your supposed reference help stuff guy.

and for the rest of the world, if you used C instead of python,

what’s the delta in daily expenditure of vital biological actuality

guise cmon

Sorry. I haven’t written any C code in over 20 years, so I’ve forgotten half of it…:sweat_smile:


Fair enough — C-only, core only:

Your recurrent BPTT part looks mostly fine.

The broken part is the softmax/output-gradient path.

Do not do this:

y[t][i] = exp(sum);
ssum += sum;
/* ... */
softmax_out = expf(y[t][i]) / softsum[t];

Do this shape instead:

z[t][i] = sum;        /* raw logit */
p[t][i] = softmax(z); /* probability */

Then backward starts with:

for (int i = 0; i < Y_S; i++)
    dy[i] = p[t][i];

dy[target[t]] -= 1.0f;

After that, the BPTT recurrence is basically the right structure:

dh = Why_T * dy + dh_next;

dh_raw = (1.0f - h[t + 1] * h[t + 1]) * dh;

dWhy += dy     * h[t + 1];
dby  += dy;

dWxh += dh_raw * x[t];
dWhh += dh_raw * h[t];
dbh  += dh_raw;

dh_next = Whh_T * dh_raw;

So: recurrent part close; softmax part not right yet.

i appreciate the extra reply, i was wondering about passing my y or p. all clarity is appreciated considering the weight of other things cutting into my life, battle for enough sleep and energy to think. couple of days. thanks!

hi again,
i spent a few weeks training my first postulated “rnn” which wasn’t really, and didn’t achieve an error coefficient better than about 1.6, and got as far as starting to order consonants and vowels some. i’ve seen about a dozen versions of crud rnn train in the years and so perceive that it’s not easy to tell if it’s written correctly from the result.

this level of implementation doesn’t seem to be a wide focus but perhaps it’s worth asking for insight.

my hypothesis was to use a 256 node hidden layer for efficiency, since i have a modest computer, and see how far i can get with a single hidden layer and elementary rnn architecture. i’ve been using a 96 node “hot one” naive character input and output layer, no embedding, no regularization or training method amendments yet. BPTT depth of 64 since iirc andrej karpathy mentioned using 100 or 128 or close in the ‘unreasonable effectiveness of rnn’. training performed on c. 300k chars of enid blyton, the tao te ching, paolo soleri and some ad copy. the increased mix at the end tends to rise the error coefficient.

(ftr my error reporting is computed from
loss -= fmax(0.0, netout[iter][i]) * std::log(fmax(1e-18f, netout[iter][i]));
i’m told this presents the power of e in characters of variance for the error coefficient)

with my current architecture i can train fairly rapidly (5 hours) to c. 1.0 instead of 1.6, dropping the learning rate only once from an initial 0.001 to 0.00031.. not exactly 1.0 but it starts to bog around this area using various strategies, eg. reducing the learning rate. i did discern that reducing the BPTT length returned to fast training for another few points but halted around 0.94.

i thought about running the error low with a small window size and then training to larger. probably not worth the gain. i’ve noted eg. the error increases rapidly if i try switching back to a higher learning rate for the rnn.

in the past i’ve discerned that switching to ADAM worked better some, and have spent hours manually adapting the learning rate, but i’m guessing my architecture isn’t large enough for what i’m throwing at it, as somewhere around h.p. lovecraft is the best i can get from it.

karpathy reported using 2 and 3 layers of 512 nodes. improvements here seem more sensible than eg. hoping to cut a lower margin with ADAM et c., dropping upper case chars, or similar strategies for reducing complexity. should i add a layer? or should i switch to 512 nodes, it will never work without embedding at that scale, or am i neglecting something?

my objective is to develop small text models for microcontroller art projects and of course learn since i enjoy dsp. but i should ask, since it’s arizona and there’s no AC here so my world is running hot. :slight_smile: thank you, and cheers all.

Hi. Hmm, I probably can’t identify the exact issue from here, but I tried to look at it from the bigger training-loop picture — basically, “if this piece were missing, would the symptoms look like this?”:


My rough read is: you may already be past the “is BPTT basically alive?” wall, and may now be hitting the “what exactly am I measuring?” wall.

I would not read this as a 256-vs-512 question first. I would map it more like this:

Layer Question My guess from your description
RNN cell Does the recurrence run at all? probably yes
BPTT skeleton Do gradients flow backward through time in the right shape? probably mostly yes
Output gradient Is the softmax/NLL gradient entering as p - target? recently clarified
Loss/reporting Is the printed number actually target NLL? suspicious
Eval loop Is there a held-out NLL/perplexity number? unclear
Stability Are gradient scale, clipping, and LR under control? next likely wall
Capacity 256 vs 512, one layer vs two, vanilla vs gated cell judge after the above
Deployment memory, quantization, weight layout later problem

The reason I would look at reporting first is this line:

loss -= fmax(0.0, netout[iter][i]) * std::log(fmax(1e-18f, netout[iter][i]));

If netout[iter][i] is the model probability for character i, then this looks like output entropy:

/*
 * Entropy of the model's own output distribution.
 *
 * This measures how spread out or uncertain the model's prediction is.
 * It does not look at the actual next character.
 */
entropy = 0.0f;

for (i = 0; i < Y_S; i++) {
    entropy -= p[i] * logf(p[i] + 1e-12f);
}

That is a useful number sometimes, but it is not the usual next-character language-model loss.

For next-character prediction, the usual question is much narrower:

how much probability did the model assign to the actual next character?

That is target negative log-likelihood:

/*
 * Next-character negative log-likelihood.
 *
 * This number looks at the target.
 * It asks:
 *
 *     how much probability did the model give to the real next char?
 */
nll = -logf(p[target] + 1e-12f);

So I would separate these two questions:

  1. Is the model learning anything?
  2. Is the printed number measuring next-character prediction?

Those are not the same question.

A model can have low entropy and still be confidently wrong.
A model can have higher entropy and still put more probability on the correct next character.

For the output layer, the shape I would expect is still:

/*
 * z = raw logits
 * p = softmax(z)
 *
 * For softmax + target NLL, the gradient at the logits is:
 *
 *     dz = p - one_hot(target)
 */
for (i = 0; i < Y_S; i++) {
    dz[i] = p[i];
}

dz[target] -= 1.0f;

Then the reported training number should usually come from the target probability, not from the whole output distribution’s entropy:

/*
 * Accumulate target NLL over characters.
 */
total_nll += -logf(p[target] + 1e-12f);
count++;

mean_nll = total_nll / (float)count;

If that mean_nll is in natural-log units, then perplexity is:

/*
 * Per-character perplexity, if mean_nll is in nats.
 *
 * mean_nll = 1.00  ->  ppl ~= exp(1.00) ~= 2.72
 * mean_nll = 0.94  ->  ppl ~= exp(0.94) ~= 2.56
 */
ppl = expf(mean_nll);

That distinction matters because “stuck around 1.0” means very different things depending on what the number is.

Printed number Meaning
-sum_i p[i] * log(p[i]) entropy of the model’s own prediction
-log(p[target]) next-character NLL
exp(mean_nll) perplexity, if mean_nll is true target NLL
train NLL only can show fitting, but not generalization
held-out NLL better signal for whether the model is really improving

So before changing the architecture, I would probably make the measurement path boring and explicit:

/*
 * Minimal reporting loop idea.
 *
 * Same model.
 * Same forward pass.
 * But print different numbers clearly.
 */
train_nll = 0.0f;
train_entropy = 0.0f;
train_count = 0;

for each training character {
    forward();

    /*
     * The number used as the LM loss.
     */
    train_nll += -logf(p[target] + 1e-12f);

    /*
     * Optional diagnostic only.
     */
    for (i = 0; i < Y_S; i++) {
        train_entropy -= p[i] * logf(p[i] + 1e-12f);
    }

    train_count++;
}

mean_train_nll = train_nll / (float)train_count;
mean_entropy   = train_entropy / (float)train_count;
train_ppl      = expf(mean_train_nll);

And then do the same thing on text not used for weight updates:

/*
 * Evaluation pass.
 *
 * No weight update.
 * No gradient accumulation needed.
 * Just forward pass + target NLL.
 */
eval_nll = 0.0f;
eval_count = 0;

for each held_out character {
    forward();

    eval_nll += -logf(p[target] + 1e-12f);
    eval_count++;
}

mean_eval_nll = eval_nll / (float)eval_count;
eval_ppl      = expf(mean_eval_nll);

Only after that would I trust the curve enough to ask whether the bottleneck is capacity.

If true held-out NLL is improving and then flattening, then yes, hidden size, gated cells, more data, better corpus separation, or more layers become real questions.

If the printed number is entropy, then the model might be improving or failing and the number may not tell you which.

The next wall after measurement is probably gradient scale. Even when the equations are right, vanilla RNN training can be touchy. I would want to know at least whether the gradient norm is exploding or getting clipped:

/*
 * Separate diagnostic:
 *
 * Are the gradients huge?
 * Are they tiny?
 * Is the LR fighting the recurrence?
 */
grad_norm = sqrtf(sum_of_squared_gradients);

if (grad_norm > clip_norm) {
    scale = clip_norm / (grad_norm + 1e-12f);

    /*
     * Multiply every gradient buffer by scale.
     */
    scale_all_gradients(scale);
}

So my order of suspicion would be:

Priority Check Why
1 Is the printed loss target NLL? otherwise the main number may not measure prediction
2 Is there held-out NLL/PPL? otherwise “stuck” is hard to interpret
3 Is softmax/log loss numerically stable? expf and logf can lie quietly
4 Are gradients clipped or at least measured? vanilla RNNs are sensitive here
5 Is LR/optimizer state stable? manual LR changes can hide other issues
6 Is the hidden state boundary correct? BPTT length and reset/carry choices matter
7 Only then: 256 vs 512 / layers / gated cell capacity is meaningful after the instruments are trustworthy

On the embedding question: for a 96-character one-hot input, I would not treat “no embedding layer” as the first-order problem. A matrix multiply by a one-hot vector is already acting like selecting one column/row of learned input weights. An explicit embedding layer may be cleaner or more efficient, but I would not expect it to magically fix a bad loss/reporting path.

So my short version would be:

/*
 * My guess:
 *
 * recurrent part: probably alive
 * output gradient: probably close now
 * printed loss: check very carefully
 * architecture: not the first suspect yet
 */

If the reported 0.94 is true target NLL in nats/char, that is one kind of result.

If the reported 0.94 is output entropy, I would not use it yet to decide whether the model needs 512 nodes, another layer, or an embedding layer.

I would first make the instruments say exactly:

mean_train_nll
mean_eval_nll
train_ppl
eval_ppl
mean_output_entropy      /* optional diagnostic */
grad_norm               /* optional diagnostic */

Then the architecture question becomes much less murky.