A Bidirectional LLM Firewall: Architecture, Failure Modes, and Evaluation Results

Over the past months I have been building and evaluating a stateful, bidirectional security layer that sits between clients and LLM APIs and enforces defense-in-depth on both input → LLM and LLM → output.

This is not a prompt-template guardrail system.
It’s a full middleware with deterministic layers, semantic components, caching, and a formal threat model.

I’m sharing details here because many teams seem to be facing similar issues (prompt injection, tool abuse, hallucination safety), and I would appreciate peer feedback from engineers who operate LLMs in production.

1. Architecture Overview

Inbound (Human → LLM)

  • Normalization Layer

    • NFKC/Homoglyph normalization

    • Recursive Base64/URL decoding (max depth = 3)

    • Controls for zero-width characters and bidi overrides

  • PatternGate (Regex Hardening)

    • 40+ deterministic detectors across 13 attack families

    • Used as the “first-hit layer” for known jailbreak primitives

  • VectorGuard + CUSUM Drift Detector

    • Embedding-based anomaly scoring

    • Sequential CUSUM to detect oscillating attacks

    • Protects against payload variants that bypass regex

  • Kids Policy / Context Classifier

    • Optional mode

    • Classifies fiction vs. real-world risk domains

    • Used to block high-risk contexts even when phrased innocently

Outbound (LLM → User)

  • Strict JSON Decoder

    • Rejects duplicate keys, unsafe structures, parser differentials

    • Required for safe tool-calling / autonomous agents

  • ToolGuard

    • Detects and blocks attempts to trigger harmful tool calls

    • Works via pattern + semantic analysis

  • Truth Preservation Layer

    • Lightweight fact-checker against a canonical knowledge base

    • Flags high-risk hallucinations (medicine, security, chemistry)

2. Decision Cache (Exact / Semantic / Hybrid)

A key performance component is a hierarchical decision cache:

  • Exact mode = hash-based lookup

  • Semantic mode = embedding similarity + risk tolerance

  • Hybrid mode = exact first, semantic fallback

In real workloads this cuts 40–80% of evaluation latency depending on prompt diversity.

3. Evaluation Results (Internal Suite)

I tested the firewall against a synthetic adversarial suite (BABEL, NEMESIS, ORPHEUS, CMD-INJ).
This suite covers ~50 structured jailbreak families.

Results:

  • 0 / 50 bypasses on the current build

  • ~20–25% false positive rate on the Kids Policy (work in progress)

  • P99 latency: < 200 ms per request

  • Memory footprint: ~1.3 GB (mostly due to embedding model)

Important note:
These results apply only to the internal suite.
They do not imply general robustness, and I’m looking for external red-teaming.

4. Failure Modes Identified

The most problematic real-world cases so far:

  • Unicode abuse beyond standard homoglyph sets

  • “Role delegation” attacks that look benign until tool-level execution

  • Fictional prompts that drift into real harmful operational space

  • LLM hallucinations that fabricate APIs, functions, or credentials

  • Semantic near-misses where regex detectors fail but semantics are ambiguous

These informed several redesigns (especially the outbound layers).

5. Open Questions (Where I’d Appreciate Feedback)

  1. Best practices for low-FPR context classifiers in safety-critical tasks

  2. Efficient ways to detect tool-abuse intent when the LLM generates partial code

  3. Open-source adversarial suites larger than my internal one

  4. Integration patterns with LangChain / vLLM / FastAPI that don’t add excessive overhead

  5. Your experience with caching trade-offs under high variability prompts

If you operate LLMs in production or have built guardrails beyond templates, I’d appreciate your perspectives.
Happy to share more details or design choices on request.

1 Like

I gathered some resources for now.

Wow - thank you - i’ll check your information package asap!

1 Like

Hello again,

I wanted to extend my sincere thanks for your incredibly detailed and actionable advice on LLM firewall architecture. Your guidance on moving beyond simple pattern matching toward a multi-layered, context-aware system has been invaluable.

We’ve directly applied several of your recommendations with measurable success:

  • Integrating Aho‑Corasick for efficient multi‑keyword matching in our SafetyValidator.
  • Replacing binary risk scores with a nuanced, weighted scoring system that aggregates signals across layers.
  • Using HarmBench’s categorized metrics to drive our prioritization, which revealed our current weak points.

As a result, our overall HarmBench ASR dropped to 18.0%, with copyright violations now at only 4.0% ASR.

We are now facing the next architectural decision—one where your system‑level perspective would be extremely helpful. Your original note recommended specialized detectors (e.g., for code‑intent or persuasive rhetoric) for “hard” cases like cybercrime/intrusion and misinformation.

Our key question is about the integration pattern for such detectors:
In a production firewall that must balance latency, maintainability, and safety, would you recommend implementing these specialized detectorss as internal layers within the core firewall engine, or as separate,asynchronously‑called microservices?

We are especially concerned about:

  1. Latency impact of model inference (e.g., a CodeBERT‑style classifier) on the synchronous request path.
  2. Lifecycle & versioning—how to update a dedicated detector without redeploying the entire firewall.
  3. Failure isolation—ensuring that a failing detector doesn’t break the entire safety pipeline.

Any high‑level guidance you could share on this architectural choice would help us invest our engineering effort in the right direction.

Thank you again for your time and for sharing your expertise. It has already made a substantial difference in my projject.

Great. I gathered some additional information. Hope it helps…

Hello again,

Following up on our previous discussion about integrating specialized detectors: We proceeded by embedding a custom convolutional neural network (CNN) for code-intent classification directly within the firewall process as a co-located library, avoiding the initial overhead of microservices.

Current Status: The detector operates in production shadow mode alongside the primary rule engine. After iterative adversarial training (focused on obfuscation and context-wrapping) and threshold optimization (θ=0.6), its performance on our defined evaluation suite shows:

  • 0% False Negative Rate for critical code/SQL injection payloads.

  • ~3% False Positive Rate on a security-focused benign subset.

  • <30ms added latency for inline inference.
    The rule engine remains the final decision-maker, ensuring operational stability.

This internal hybrid pattern validated the core concept for our first detector. We are now planning to scale the architecture to incorporate additional specialized detectors (e.g., for persuasion, misinformation).

Based on your experience evolving such a system:

  1. Orchestration Pattern: For a multi-detector system, did you find a hierarchical router (dispatching to specific detectors) or a sequential pipeline (where all relevant detectors evaluate the prompt) to be more maintainable and performant in production?

  2. Continual Learning: For detectors that must adapt to new tactics, what has been a reliable operational pattern to retrain and safely deploy updated models without causing service disruption or regression in core safety metrics?

Your insights on scaling this architecture would be invaluable.

1 Like

Follow-up Questions - Multi-Detector Architecture

Hello again,

Following up on our previous discussion about integrating specialized detectors: We proceeded by embedding a custom convolutional neural network (CNN) for code-intent classification directly within the firewall process as a co-located library, avoiding the initial overhead of microservices.

Current Status: The detector operates in production shadow mode alongside the primary rule engine. After iterative adversarial training (focused on obfuscation and context-wrapping) and threshold optimization (θ=0.6), its performance on our defined evaluation suite shows:

- 0% False Positive Rate (0/1000 benign samples across 9 categories)

- 95.7% Attack Detection Rate (557/582 adversarial samples)

  • Mathematical notation camouflage: 100% blocked (300/300)

  • Multilingual code-switching: 91.1% blocked (257/282, 25 bypasses)

- <30ms added latency for inline inference

The rule engine remains the final decision-maker, ensuring operational stability.

This internal hybrid pattern validated the core concept for our first detector. We are now planning to scale the architecture to incorporate additional specialized detectors (e.g., for persuasion, misinformation).

Based on your experience evolving such a system:

**Orchestration Pattern:** For a multi-detector system, did you find a hierarchical router (dispatching to specific detectors) or a sequential pipeline (where all relevant detectors evaluate the prompt) to be more maintainable and performant in production?

**Continual Learning:** For detectors that must adapt to new tactics, what has been a reliable operational pattern to retrain and safely deploy updated models without causing service disruption or regression in core safety metrics?

**Critical Follow-up Questions:**

**1. Shadow Mode to Production Transition:**

We’re currently operating in shadow mode with the rule engine as fallback. What has been your experience transitioning detectors from shadow mode to active production? Are there specific metrics thresholds (e.g., FPR <1%, FNR <5%) or validation periods (e.g., 2-4 weeks) that you found reliable before making the switch? How do you handle the transition without disrupting existing safety guarantees?

**2. Handling Known Bypasses:**

We have 25 multilingual attacks bypassing detection (8.9% of multilingual test suite) due to code embedded in string literals/comments that get filtered by preprocessing. Should we address these before production deployment, or is it acceptable to deploy with known limitations if they’re well-documented and monitored? What’s your threshold for “acceptable risk” when deploying security systems?

**3. Production FPR/FNR Monitoring:**

What monitoring infrastructure have you found most effective for tracking FPR/FNR in production? Do you use automated sampling, manual review queues, or a combination? How do you distinguish between legitimate false positives (user complaints) and actual system degradation? Any tools or frameworks you’d recommend?

**4. Sequential Pipeline at Scale:**

If we start with a sequential pipeline for 2-3 detectors, at what point does latency become a bottleneck? Have you found a practical limit (e.g., 3-4 detectors, 100ms total) before needing to switch to a router pattern? What were the key indicators that triggered your transition?

**5. Retraining Workflow:**

For establishing a retraining workflow, what’s your recommended validation process? We’re thinking: automated test suite (1,000+ samples), shadow mode deployment, regression testing (FPR/FNR thresholds), then gradual rollout. Is this reasonable, or are there critical steps we’re missing? How do you handle model versioning and rollback?

**6. Real-World Validation:**

Our test corpus is programmatically generated. How critical is it to validate with real-world production queries before scaling? Should we deploy the first detector to production first to collect real data, or can we proceed with synthetic test suites for additional detectors?

**7. Co-location Limits:**

With our current co-location approach adding <30ms per detector, how many detectors have you successfully co-located before hitting memory or latency constraints? At what point did you need to consider microservices or other architectural changes?

Your insights on these practical scaling challenges would be invaluable as we move toward a multi-detector system. TY :slight_smile:

1 Like

I generated the continuation.

STATUS : Hybrid system with parallel execution of Code-Intent CNN (100% accuracy) and Content-Safety Transformer (100% accuracy). Rule engine final decision layer. Overall attack detection: 100% on core test set (101/101). False positive rate: 0% (0/1000 benign samples). Latency: <35ms for two parallel detectors.

Fixed 25 multilingual bypasses via preprocessing improvements. Identified new attack vector: poetic/metaphorical attacks (current detection: 83%, 20/24). Online learning active with 92 feedback samples. Conservative OR-logic: one detector blocks = overall block.

Next: shadow mode validation, router implementation for third detector, poetic attack mitigation via metaphor detection patterns. Thank you for your valueable help!!! :slight_smile:

1 Like