ARCUS-H: Full Evaluation Results — 979,200 Episodes, 51 RL Policies

We completed a large behavioral stability evaluation of trained RL policies of : 979,200 evaluation episodes across 51 policy configurations, 12 environments, 8 algorithms, and 8 structured stress schedules. Here are three findings that matter for deployment.

:backhand_index_pointing_right: Finding 1: Reward explains 3.7% of behavioral stability variance.
The primary correlation between ARCUS-H stability scores and normalized reward is r = +0.240 [0.111, 0.354], p = 1.1×10⁻⁴ (n = 255 policy-level observations, 2,550 seed-level). R² = 0.057.
94.3% of the variance in how a policy behaves under sensor noise, actuator failure, or reward corruption is not captured by its return in clean conditions. 87% of policies rank differently under ARCUS-H vs reward rankings, with a mean rank shift of 74.4 positions.

:backhand_index_pointing_right: Finding 2 : SAC’s entropy objective amplifies sensor fragility.
SAC collapses at 92.5% under observation noise. TD3 collapses at 61.0% under the identical stressor — same environments, same training budget, both off-policy actor-critic.
This was first observed in a pilot evaluation on 47 pairs (90.2%/61.1%) and is now replicated across 51 pairs and 10 seeds. The mechanism is clear: SAC’s entropy maximization amplifies sensitivity to noisy observations. TD3’s target action smoothing provides implicit robustness.
If you are choosing between SAC and TD3 for a noisy real-world deployment: this matters. Return alone will not tell you.

:backhand_index_pointing_right: Finding 3 : CNN robustness is representation-dependent, not architecture-determined.
ALE/SpaceInvaders-v5 collapses at 13% under observation noise. ALE/Pong-v5 collapses at 42% under the identical stressor. Same CNN architecture. Same AtariPreprocessing + FrameStack wrapper.
The difference is learned representation structure. SpaceInvaders requires the CNN to develop distributed, compositional features. Pong can be solved with localized object tracking. Different task complexity produces different representation structure, which produces different robustness to pixel noise.

The implication for sim-to-real: you cannot infer a CNN policy’s sensor robustness from its architecture. You have to measure it.

ARCUS-H is open source. No retraining required. Works with any SB3 policy.
Run on your SB3 model

bash

git clone https://github.com/karimzn00/ARCUSH

python -m arcus.harness_rl.run_eval \
    --run_dir path/to/your/model \
    --env     HalfCheetah-v4 \
    --algo    td3 \
    --seeds   0-4 \
    --episodes 120 \
    --both

# Atari (add obs-normalize for stressor symmetry):
python -m arcus.harness_rl.run_eval \
    --env ALE/Pong-v5 --algo ppo \
    --seeds 0-4 --episodes 120 --both --obs_normalize

Code + more details : https://github.com/karimzn00/ARCUSH

1 Like