What is a decision log in AI?

A decision log is a structured payload that captures the exact telemetry of a system decision, including inputs, rule triggers, model scores, and human overrides at a specific point in time.

How do you make an AI system replayable?

Replayability is achieved by storing immutable records of decision-time context and model versions, allowing a system to recreate past outcomes for audit and forensics.

AI Audit Trails: Replayable AI for High-Stakes Systems | AI Governance

This post is part of the Decision Traceability pillar.

This article explains how to build AI audit trails and decision logs so you can replay decisions and prove what your system knew at decision time—capturing the full decision-time context: inputs, rule firing, model versions, and human overrides needed to pass regulatory and internal audits.

How do you design an AI system for replayability?

Replayable AI is achieved by capturing all decision-time context—inputs, model versions, and rule states—allowing any past outcome to be perfectly reconstructed and audited.

Explainable AI is about making a model's behavior understandable in the moment. It answers questions like "Which features mattered most?" or "Why did this case get a higher score?".

Replayable AI is about making a system's behavior reconstructable after the fact. It answers: "What inputs did the system actually see?", "Which rules fired?", and "Did any human override the result?".

Replayability is what lets you take a past decision, feed the same inputs and models back through, and get the exact same result. That’s the foundation for audits and real accountability.

The End-to-End Decision Timeline

To understand how these components interact at runtime, we can visualize the Decision Assembler Flow. This is the core logic engine that orchestrates deterministic rules, probabilistic models, and human oversight into a single immutable record.

The Decision Assembler Flow: A high-fidelity diagram showing the 4-stage journey from Data Ingestion to Final Outcome, featuring the central Decision Record Builder and the record-based replay loop.

⏱ THE END-TO-END DECISION TIMELINE

1. Data Ingestion

➜

Raw inputs anchored with cryptographic "Audit Anchor".

2. Parallel Processing

➜

Rules ensure compliance while Model provides insight.

3. Decision Assembler

➜

Rules, Model, and Control Tier merged into Record Builder.

4. Final Action

➜

System auto-executes or triggers Human Escalation.

5. Replay Loop

➜

Outcome linked to Anchor for forensic reconstruction.

The Audit Note: Every stage of the timeline should be cryptographically linked to the previous one, forming a "Chain of Proof" that makes the entire decision reconstructable in an isolated test environment.

Decision Log Record Anatomy

A mature audit trail is built on "Decision Records"—structured payloads that capture the assembly of a decision at the boundary.

Decision Log Record Anatomy: A diagram showing the internal structure of an AI decision record, including Header, Inputs, Rules, and Model blocks

Header: Decision ID, timestamp, system name.
Input Block: Key non-sensitive fields stored directly, plus a SHA-256 hash of the full payload and a reference ID pointing to a separate payload store.
Rules Block: List of rules with true/false results.
Model Block: Model name, version, and raw score.
Control Tier: The permissions active at that millisecond.
Human Block: Reviewer ID and override reason codes.

On Content Hashes: A hash is a one-way tamper seal, not a data replacement. You cannot reverse a hash back to the original data. The correct pattern is a two-part architecture:

The Audit Record stores key fields + a SHA-256 hash + a payload_ref_id.

A separate Payload Store (e.g., S3 or a secured DB) holds the full raw inputs, indexed by payload_ref_id.

During debugging, you fetch the raw payload from the Payload Store using the payload_ref_id, then re-hash it and compare against the stored hash. If they match, the data is provably unaltered. If they don't, you have detected tampering.

Decision Record Framework

Component	Content	Purpose
Header	Decision ID, Timestamp, Logic Version	Unique identification and anchoring in time.
Input Block	Key fields (direct) + SHA-256 + `payload_ref_id`	Direct fields for fast querying; hash + ref for full-data retrieval and tamper detection.
Rules Block	Rule Name, Result (Pass/Fail)	Verifying which deterministic constraints were met.
Model Block	Model ID, Version, Raw Score, Threshold	Documenting the probabilistic insight delivered.
Control Tier	Tier (0-3), Active Permissions	Proving the level of autonomy authorized at execution.
Human Block	Reviewer ID, Override Reason Code	Capturing accountability for any manual deviations.

Why Key Fields + SHA-256 + Ref ID?

Each of the three components serves a completely different purpose. Here is a concrete lending example:

{
  "decision_id": "dr-20240315-A7821",
  "timestamp": "2024-03-15T14:23:01Z",

  "loan_amount": 15000,
  "credit_score_bucket": "650-700",
  "region": "TX",
  "decision": "DECLINED",
  "control_tier": 2,

  "payload_sha256": "a3f8b2c1d4e5...",
  "payload_ref_id": "s3://audit-payloads/2024/03/A7821.json.gz"
}

Key fields — for fast, direct querying without ever touching the payload store. > "Show all TX applications with credit score 650–700 that were declined in March" > → Query runs on the audit DB alone. No S3 round-trips. Milliseconds.
SHA-256 hash — for tamper detection during a formal audit. > "Prove this record hasn't been altered since it was written" > → Fetch the raw payload via payload_ref_id, re-hash it, compare to payload_sha256. Match = untouched. No match = evidence of tampering.
Ref ID — for full reconstruction during regulatory inquiry or model retraining. > "What exact inputs did the model see for application A7821?" > → Fetch s3://audit-payloads/2024/03/A7821.json.gz and get every feature, field, and raw value.

You only go to the payload store for edge cases — regulatory review, model retraining, or a tamper investigation. For everything else, the key fields are sufficient, which is why costs stay low.

The Difference Between Infra Logs and Decision Logs

While infrastructure logs track system health and performance, decision logs track the logic and sanity of AI-driven outcomes for governance and compliance.

Most teams have great infrastructure logs (latency, CPU, request IDs). These tell you if the system was healthy. Decision logs tell you if the system was sane.

Technical Stack Comparison: Decision Logs (Governance) sit above Infrastructure Logs (SRE / Ops), showing the two separate worlds of AI governance and system health monitoring

Infra logs live in the SRE world. Decision logs live in the Governance world. You need both to prove that an AI system is operating within its intended guardrails.

Designing for Replayability

Designing for replayability requires a central Decision Log Store and Model Registry that can feed a Replay Engine with the exact state required to reproduce a past decision.

To reconstruct history, you have to design for it up front. This requires a specific architecture where your Decision Log Store and Model Registry can perfectly feed a Replay Engine.

AI Replay Architecture: Block diagram showing how a Replay Engine uses a Model Registry and Decision Log Store to recreate past outcomes

If you can't recreate a past decision exactly in a test environment, your audit trail has a gap.

Making Audit Access a Product

Finally, your audit trails shouldn't be hidden in a database that requires an engineer to query. They should be accessible via a Query Flow designed for risk and compliance users.

AI Audit Trail Query Flow: User interface diagram showing the path from a compliance console to individual decision records

This allows non-technical stakeholders to ask questions like "Show all Tier 2 decisions for segment X" or "Show defaults where rules said 'no' but the system approved."

Storage Planning for Audit Records

A well-designed audit trail is small enough to be affordable and structured enough to be queryable — but only if you plan for the two-layer architecture from day one.

Audit storage lives in two distinct layers with very different scaling profiles.

Layer 1: Decision Records

The audit record itself is lean. A compact JSON payload for a single decision is approximately:

Component	Estimated Size
Header (Decision ID, timestamp, logic version)	~100 bytes
Input Block (key fields + SHA-256 hash + ref ID)	~300 bytes
Rules Block (10–20 rule names + boolean results)	~500 bytes
Model Block (model ID, version, score, threshold)	~200 bytes
Control Tier + Human Block	~150 bytes
Total per record	~1–2 KB

Annualized storage at different decision volumes:

System Scale	Daily Volume	Yearly (raw)	Yearly (compressed)
Small team / internal tool	1K decisions	~700 MB	~70–140 MB
Mid-size product	100K decisions	~70 GB	~7–14 GB
High-frequency (e.g., lending)	1M decisions	~730 GB	~75–150 GB

Note on compression: Columnar formats like Parquet achieve 5–10x compression on structured audit records, especially on repeated string values like rule names, model IDs, and control tier codes.

Layer 2: Payload Store (Full Raw Inputs)

This is where storage scales fast. Full ML feature vectors or document payloads typically run 20–100 KB per decision:

System Scale	Avg Payload Size	Yearly
100K decisions/day	50 KB	~1.8 TB/year
1M decisions/day	50 KB	~18 TB/year

Design choice: If you don't require full replay — only tamper detection — you can store only the hash and skip the payload entirely. This drops Layer 2 cost to zero at the expense of not being able to reconstruct the exact original input.

Tiered Retention Strategy

Hot   (0–30 days)   → Postgres / DynamoDB     Fast queries for live compliance review
Warm  (30d–1yr)     → S3 + Athena / BigQuery  Cheap, still queryable for regulatory requests
Cold  (1yr+)        → S3 Glacier / Deep Archive  Pennies/GB for long-term legal retention

The decision record layer is affordable at any scale. The payload store is where you need a deliberate tiering policy. For most regulated industries, a 3-year warm + 7-year cold retention window covers audit obligations.

Practical Tips for Your Audit Stack

✓ AUDIT STACK BEST PRACTICES

● Log at the decision boundary (capture the final assembled picture).

● Keep payloads lean (store hashes for large objects elsewhere).

● Separate PII from mechanics (access-control personal data).

● Test replay in drills (periodically verify exact reconstruction).

A mature AI system should be able to show its work, prove consistency, and support accountability. This level of detail is exactly how audit trails enable effective AI escalation and provide the evidence needed for Composite Accountability.

Building Accountability frameworks for AI: See how audit trails anchor system-wide responsibility.
How AI Systems Ask for Help: Learn how audit logs provide context for automated escalation.

Download the Architecture of Proof Checklist

Ready to implement? Get the definitive checklist for building verifiable AI systems.