Update:
## System Architecture
Four microservices operating as independent processes:
- **Orchestrator Service** (Port 8001): Central routing and decision aggregation. Implements hexagonal architecture (Ports & Adapters) with domain layer (ports, value objects, business rules), application services, and infrastructure adapters.
- **Code Intent Service** (Port 8000): Malicious code execution detection
- **Persuasion Service** (Port 8002): Manipulation and misinformation detection
- **Content Safety Service** (Port 8003): Policy enforcement and content safety validation
**Architectural Note:** The overall system follows layered microservices design. The Orchestrator Service internally uses hexagonal architecture (domain ports, value objects like `RoutingDecision`, `DetectorConfig`, domain services like `SemanticGate`, `LiabilityDecider`, and infrastructure adapters like `JudgeEnsembleAdapter`).
### Request Flow
All requests enter through the Orchestrator Service (Port 8001), which implements a five-layer filtering pipeline:
1. **Perimeter Service** (Layer 1): Fast pattern matching (<1ms). Whitelist (8 patterns) and hard block (15 patterns) checks. If matched, immediate allow/block. Otherwise, proceeds to Layer 2.
2. **Judge-Ensemble** (Layer 2): Semantic analysis using three embedding models (all-MiniLM-L6-v2, intfloat/e5-large-v2, thenlper/gte-base). Computes cosine distances to reference vectors. Final distance is median of three model outputs. Median aggregation provides robustness against outliers but does not account for model uncertainty variance (weighted Bayesian fusion would be statistically more efficient but requires uncertainty quantification).
3. **Intent Verification** (Layer 3): Quantifies alignment with the “Creative Writing” intent contract. Low distance (< 0.85) signals valid intent, triggering Safety Valve logic, subject to Layer 4 override.
4. **Veto Mechanism** (Layer 4): Deterministic pattern check. Blocks requests matching critical threat patterns (SSRF, code execution, SQL injection, command injection) regardless of semantic gate decision.
5. **Detector Services** (Layer 5): Orchestrator invokes specialized detectors based on routing policies:
-
Code Intent Service (Port 8000): Multi-stage pipeline (normalization, 10 rule-based validators, optional CodeBERT inference)
-
Persuasion Service (Port 8002): Pattern matching for rhetorical patterns, authority claims, social proof
-
Content Safety Service (Port 8003): Pattern-based classification (Total patterns loaded: 139 (JAILBREAK: 13, CONTENT_SAFETY: 106, CYBERSECURITY: 10, ROLEPLAY: 3, TECHNICAL_ATTACK: 7 - Enhanced patterns: holocaust_denial, enslavement, malware_extended, privilege_escalation, academic_dishonesty)
Results aggregated using confidence-weighted fusion with OR-threshold: if any detector blocks (score ≥ 0.7) or provides hard evidence, request is blocked.
## Security Properties
### Adversarial Robustness
Ensemble approach provides defense against adversarial examples optimized for a single embedding model. Median aggregation requires simultaneous fooling of models with different architectures and training objectives. Robustness is heuristic, not certified: no formal verification of transfer attack resistance or gradient obfuscation properties.
### Deterministic Safety
Veto mechanism provides hard guarantee: critical threat patterns always blocked, independent of semantic analysis results. Fail-safe design: veto overrides semantic approval when hard evidence present. Stateful analysis eliminates Time-of-Check to Time-of-Use (ToCToU) vulnerabilities inherent in optimistic streaming architectures.
### Performance Optimization
Perimeter service filters 80% of requests at sub-millisecond latency, reducing computational load. Orchestrator focuses on ambiguous cases requiring deep analysis.
## Implementation Details
### Model Version Validation
Model weights validated at startup using SHA256 hashes. Mismatches logged but do not prevent startup (graceful degradation). Note: Fail-open behavior trades security for availability; fail-closed (service shutdown on mismatch) would provide stronger security guarantees.
**Implementation (2025-12-23):** Hash calculation supports both file-based and model-object-based validation:
- **Model Object Validation**: Direct hash calculation from loaded model parameters (SentenceTransformers)
- **File-based Validation**: Fallback to cache file hashing when model object not available
- **Judge-Ensemble Models**: All 3 models (all-MiniLM-L6-v2, intfloat/e5-large-v2, thenlper/gte-base) validated at startup
### Rate Limiting
Token bucket algorithm limits expensive semantic checks per user (10 checks/minute, burst: 20). Falls back to strict mode (pattern-based blocking only) when budget exhausted. Rate limits are empirically determined; user reputation-based differentiation not implemented. Exponential backoff after burst exhaustion not specified.
### Context Fusion
Multi-turn conversations handled via sliding window (window size: 4-5, TTL: 300s). Window size empirically determined, not formally optimized. Perimeter service operates on original messages (stateless). Orchestrator uses fused context (stateful) for security analysis.
### Unicode Security and Trojan Source Protection
Unicode security checks implemented with TR39 compliance for confusable and mixed-script detection. NFKC normalization (TR15) applied as first step. Trojan Source (BIDI control abuse) detection explicitly scoped to code/execution contexts only: checks performed in `CODE_EXECUTE` and `TOOL_INPUT` contexts, skipped in `PLAIN_TEXT` contexts. Execution context auto-detected from code blocks, tool calls, and execution patterns. This prevents false positives in general text while maintaining security for code-bearing inputs.
**P0 Enhancement (2025-12-23):** Unicode TR39 Full Tables:
- **Expanded Script Properties**: Comprehensive TR39 script ranges including Armenian, Georgian, Ethiopic, Bengali, Gujarati, Tamil, Thai, Hangul, and more
- **Complete Confusable Tables**: Extended homoglyph mappings including zero-width characters, mathematical symbols, and script-specific lookalikes
- **Compliance Verification**: Automated tests verify TR39 compliance for NFKC normalization, script detection, confusable detection, and mixed-script detection
### Surprise Metrics and Operational Triggers
Surprise metric (uncertainty signal) extracted from detector results and used operationally, not only for telemetry. Triggers activated based on surprise thresholds:
- **Extra Detectors** (surprise ≥ 0.7): Additional detectors invoked (code_intent, persuasion)
- **Stricter Mode** (surprise ≥ 0.8): Block threshold lowered (0.7 → 0.5), confidence requirement increased (0.9)
- **Human Review Sampling** (surprise ≥ 0.9): Request marked for human review with high priority (100% sampling rate)
Surprise extraction uses multiple fallback strategies from detector metadata. Operational triggers integrated into routing decision logic.
**P2 Enhancement (2025-12-23):** Surprise Metrics Monitoring now includes:
- **Alerting System**: Automatic alerts on surprise spikes (threshold: 0.75, critical: 0.85)
- **Spike Detection**: Tracks surprise history and detects multiple high-surprise events in time windows (5-minute windows, 3+ events trigger alert)
- **Dynamic Detector Addition**: Extra detectors automatically added to routing decisions when surprise spikes detected
- **Statistics API**: Real-time surprise statistics for monitoring (count, average, max, recent high-count)
### Service Communication
Orchestrator communicates with detector services via HTTP REST APIs. Asynchronous requests with configurable timeouts. Circuit breakers prevent cascading failures.
**P2 Enhancement (2025-12-23):** Circuit Breaker Retry Budgets (Envoy-Style):
- **Retry Budget Management**: Per-detector retry budgets prevent retry storms (max: 20 retries, refill: 2 retries/second)
- **Adaptive Routing**: Circuit breaker state influences routing decisions:
-
**OPEN**: Load shedding (skip non-required detectors), fallback detector selection for required detectors
-
**HALF_OPEN**: Only high-priority detectors (priority 1) used in test mode
- **Load Shedding**: Optional detectors automatically skipped when circuit breakers are open
- **Fallback Detection**: Automatic rerouting to alternative detectors when primary detector unavailable
### Benchmark Integration
**P0 Enhancement (2025-12-23):** AgentDojo/InjecAgent Benchmark Support:
- **Final Actions Tracking**: Distinguishes attacker-intended actions from legitimate actions in benchmark runs
- **Execution State Capture**: Complete execution artifacts including tool calls, tool outputs, and final decisions
- **Action-Complete Artifacts**: Full execution state captured for reproducibility and forensic analysis
- **Benchmark Runners**: Enhanced `agentdojo_runner.py` and `injecagent_runner.py` with final action extraction
- **Blast-Radius Tests**: Integration tests verify blocking of malicious instructions in RAG context and tool output
## Performance Characteristics
**Latency Distribution**: Bimodal. ~0.5ms for 80% of traffic (Perimeter), ~2.5s for 20% deep analysis. This explicitly mitigates Economic Denial of Sustainability (EDoS) attacks. Bimodal distribution creates SLA challenges: P95 latency dominated by slow path (~2.5s), P99 not specified. Admission control recommended for tail latency protection.
- **Perimeter Service**: <1ms for pattern-matched requests (O(n) complexity, not O(1); scales with input size)
- **Orchestrator (Deep Analysis)**: 2-5 seconds (includes ensemble inference and detector execution; variance not specified)
- **Detector Services**: 200-500ms per service (rule-based) or 1-2 seconds (with ML inference; cache hit/miss behavior not specified)
## Evaluation Results
### HarmBench Evaluation (2025-12-23)
**Test Configuration:**
- Total Behaviors: 1824
- Total Prompts: 1824
- Evaluation Duration: 276.9s (6.6 req/s)
**Results:**
- **True Positive Rate (TPR):** 98.0% (1788/1824 blocked)
- **Attack Success Rate (ASR):** 2.0% (36/1824 allowed)
- **Wilson CI (95%):** TPR [97.3%, 98.6%], ASR [1.4%, 2.7%]
- **Average Risk Score:** 0.946
- **Average Latency:** 3607.5ms
**Status:** ASR < 10% - Production-grade protection
### Benign Request Evaluation (2025-12-23)
**Test Configuration:**
- Total Samples: 1000 benign requests
- Test Target: Orchestrator Service (Port 8001)
**Results:**
- **Allow Rate:** 97.0% (970/1000 allowed)
- **False Positive Rate (FPR):** 3.0% (30/1000 blocked)
- **Average Risk Score:** 0.047
- **Median Risk Score:** 0.000
- **Risk Score Range:** 0.000 - 0.950
**Characteristic False Positives:**
- All 30 false positives have Risk Score 0.950
- Common phrases: “Can you help me understand:”, “I’m curious:”, “Please explain in simple terms”
## Recent Updates (2025-12-23)
### P0 Features - Critical Security Enhancements
#### AgentDojo/InjecAgent Final Actions Tracking
Benchmark runners now track final actions to distinguish attacker-intended from legitimate actions:
- **Final Action Classification**: `attacker_intended`, `legitimate`, or `blocked` based on test results
- **Execution State Capture**: Complete execution artifacts including tool calls, tool outputs, and final decisions
- **Action-Complete Artifacts**: Full execution state captured for reproducibility and analysis
- **Integration**: Implemented in both `agentdojo_runner.py` and `injecagent_runner.py`
#### Blast-Radius Boundary Tests
Explicit integration tests for malicious instructions in retrieved context and tool output:
- **RAG Context Poisoning**: Tests verify blocking when malicious instructions present in retrieved context
- **Tool Output Poisoning**: Tests verify blocking when malicious instructions present in tool output
- **Payload Stripping**: InjecAgent-style payload stripping verification
- **Test Coverage**: Comprehensive boundary tests in `test_blast_radius_boundaries.py`
#### Unicode TR39 Full Tables
Complete TR39 compliance implementation:
- **Expanded Script Properties**: Comprehensive script ranges (Armenian, Georgian, Ethiopic, Bengali, Gujarati, Tamil, Thai, Hangul, etc.)
- **Complete Confusable Tables**: Extended homoglyph mappings including zero-width characters and mathematical symbols
- **Compliance Verification**: Automated tests in `test_tr39_compliance_verification.py`
#### LIABILITY_REQUIRED Logging/Redaction
Minimal logging with aggressive redaction for dual-use requests:
- **Sensitive Field Redaction**: User IDs hashed, prompt/response content replaced with hash placeholders
- **PII Protection**: IP addresses, email addresses, phone numbers, geolocation redacted
- **Integration**: `LiabilityLoggingRedactor` integrated into `LiabilityDecider` for automatic redaction
### P2 Features - Operational Enhancements
#### Circuit Breaker Retry Budgets (Envoy-Style)
Retry storm prevention with per-detector budgets:
- **Retry Budget Port**: `RetryBudgetPort` interface for budget management
- **Retry Budget Adapter**: Envoy-style implementation with configurable limits (max: 20, refill: 2/s)
- **Adaptive Routing**: Circuit breaker state influences detector selection:
-
**OPEN**: Load shedding (skip optional), fallback selection for required
-
**HALF_OPEN**: Only priority 1 detectors in test mode
- **Load Shedding**: Automatic skipping of non-critical detectors under load
#### NIST Mapping Versioning
Controlled interface for NIST AI 100-2e2025 mapping changes:
- **Version Control Process**: MAJOR.MINOR.PATCH versioning with documented workflow
- **Review-Required Workflow**: All changes require review and approval before deployment
- **Auditor Traceability Matrix**: Complete traceability from internal categories to NIST attack taxonomy
- **Documentation**: `NIST_MAPPING_VERSIONING_PROCESS.md` and `NIST_AUDITOR_TRACEABILITY.md`
#### Surprise Metrics Monitoring (Enhanced)
Operational alerting and dynamic detector addition:
- **Alerting System**: Automatic alerts on surprise spikes (threshold: 0.75, critical: 0.85)
- **Spike Detection**: Tracks surprise history and detects multiple high-surprise events in time windows
- **Dynamic Detector Addition**: Extra detectors automatically added to routing when surprise spikes detected
- **Statistics API**: Real-time surprise statistics for monitoring
### Trojan Source Explicit Scoping
BIDI control character detection (Trojan Source attacks) now explicitly scoped to execution-bearing contexts. Checks performed only in `CODE_EXECUTE` and `TOOL_INPUT` contexts, automatically skipped in `PLAIN_TEXT` contexts. Execution context auto-detected from code block patterns, tool call indicators, and execution function patterns. Prevents false positives in general text while maintaining security for code inputs.
### Unicode TR39 Compliance
Confusable and mixed-script detection now uses TR39-compliant data structures and rules. NFKC normalization (TR15) remains first processing step. Trojan Source checks remain context-scoped (code/execution only).
## Limitations
1. Ensemble requires GPU resources for real-time inference (3 models loaded simultaneously)
2. Pattern-based filtering may have false positives for edge cases (3.0% FPR observed)
3. Semantic analysis latency scales with request complexity (average 3.6s for HarmBench)
4. Surprise metric thresholds empirically determined; formal optimization not performed
5. Execution context auto-detection uses heuristics (code block patterns, function calls); may miss edge cases
6. Retry budget refill rates are fixed (2 retries/second); adaptive refill based on system load not implemented
7. NIST mapping versioning requires manual review process; automated compliance checking not implemented
8. Surprise spike detection uses fixed time windows (5 minutes); adaptive window sizing not implemented
## Architecture Compliance
### Hexagonal Architecture (Orchestrator Service)
The Orchestrator Service implements strict hexagonal architecture:
**Domain Layer:**
- **Ports**: `DetectorRouterPort`, `CircuitBreakerStatePort`, `RetryBudgetPort`, `JudgeEnsemblePort`, `SecurityAuditLoggerPort`, `SecurityMetricsPort`
- **Value Objects**: `RoutingDecision`, `DetectorConfig`, `DetectorResult`, `AggregatedResult`
- **Domain Services**: `SemanticGate`, `LiabilityDecider`, `AdvancedContextAnalyzer`
**Application Layer:**
- **Use Cases**: `IntelligentRouterService` (implements `DetectorRouterPort`)
- **Orchestration**: Request routing, detector execution, result aggregation
**Infrastructure Layer:**
- **Adapters**: `JudgeEnsembleAdapter`, `CircuitBreakerStateAdapter`, `RetryBudgetAdapter`, `PerimeterServiceAdapter`
- **External Services**: Circuit breakers, retry budgets, model validation
**Benefits:**
- Testability: Domain logic testable without infrastructure dependencies
- Flexibility: Easy adapter swapping (e.g., different retry budget implementations)
- Maintainability: Clear separation of concerns
### Compliance and Standards
**NIST AI 100-2e2025 Mapping (P2 - 2025-12-23):**
- **Version Control**: Controlled interface for mapping changes with MAJOR.MINOR.PATCH versioning
- **Review-Required Workflow**: All changes require review and approval before deployment
- **Auditor Traceability**: Complete traceability matrix from internal categories to NIST attack taxonomy
- **Documentation**: `NIST_MAPPING_VERSIONING_PROCESS.md` and `NIST_AUDITOR_TRACEABILITY.md`
- **Mapping Files**: `nist_ai100_2e2025_mapping.yaml` and `NIST_MAPPING.json` with version tracking
**Unicode TR39 Compliance (P0 - 2025-12-23):**
- **Complete Script Properties**: Comprehensive TR39 script ranges for security analysis
- **Full Confusable Tables**: Extended homoglyph mappings including zero-width characters
- **Compliance Verification**: Automated tests verify TR39 compliance
- **NFKC Normalization**: TR15 normalization as foundational step ——- the work will never end… 