A model review is not a monitoring dashboard check. It is a structured governance event with a defined process, specific evidence requirements, and one of three mandatory outputs. This playbook defines how to run one.

The AI Model Review Playbook: A Step-by-Step Process for Production Models | AI Governance

This post is part of the Governance Operating Model pillar.

Most organizations monitor their models. Fewer review them.

The difference is consequential. Monitoring tells you whether current metrics are within acceptable ranges. A review asks whether the model's original assumptions still hold, whether its scope has drifted, whether its contracts are still the right contracts, and whether the evidence supports continued operation at the current autonomy tier.

This playbook defines the process for a structured model review from trigger to output.

When to run a model review

Scheduled reviews: - High-materiality models (automated decisions with significant financial, legal, or customer impact): quarterly - Medium-materiality models: semi-annual - Low-materiality models: annual - All models: after any major retraining

Triggered reviews (unscheduled): - Any drift alert that crosses the defined threshold - Any incident involving the model - Any significant change in business context (new customer segment, new product, regulatory change) - Any material change in input distributions - Any autonomy tier upgrade proposal

Who runs the review

Reviewer: A function independent of the model team — model risk, an independent data science review board, or a designated governance committee.

Participants: The model team (provides technical evidence), operations (provides outcome data), risk/compliance (confirms regulatory alignment), product (provides business context).

Accountable executive: The designated risk owner in the RACI — not the engineering manager or product lead.

The most common governance mistake in model review is letting the model team review its own model. The team that built the model has an incentive to find it acceptable and a blind spot for the assumptions they made during development.

The review evidence package

Before the review meeting, the model team assembles the evidence package. This is the primary input to the review. The package must contain:

AI Model Review Playbook: A step-by-step process showing five stages — Evidence Assembly, Performance Assessment, Scope Validation, Contract Review, and Output Decision — with required artifacts at each stage

1. Performance metrics against the original contract - Current AUC, precision/recall, or domain-specific performance metrics on fresh holdout data — not training data, not the original validation set - Calibration analysis: are scores still reflecting actual probabilities? - Segment-level breakdown: is performance uniform across the segments the model is applied to, or is it stronger in some and weaker in others? - Comparison to the baseline set at deployment: has performance improved, degraded, or remained stable?

2. Population stability analysis - PSI (Population Stability Index) on key input features: has the distribution of inputs changed since the model was trained? - Any new patterns in the input population that the model was not trained on? - Volume trends: is the model being called for a significantly different volume or mix of cases than at deployment?

3. Outcome linkage - For models where outcome data is available (credit default, fraud confirmation, claims paid): what do actual outcomes look like for decisions made by this model? - If the model is a ranking or scoring model, does the ordering of scores correlate with observed outcomes?

4. Scope compliance - Is the model being applied to cases within its defined scope? - Any evidence of scope creep — the model being used for decision types it was not validated for?

5. Contract compliance - Is the model meeting all the thresholds defined in its performance contract? - Were any circuit breakers triggered since the last review? If so, what caused them and how were they resolved?

6. Incident summary - Any incidents attributable to the model since the last review, their root causes, and the remediation actions taken

The five-step review process

Step 1: Evidence assessment (30 minutes)

Reviewers read the evidence package before the meeting. The meeting begins with any clarifying questions on the evidence — not with a summary of the evidence by the model team.

The first question: is the evidence package complete? If required data is missing — particularly fresh holdout performance or outcome linkage — the review cannot proceed until it is provided.


Step 2: Performance decision (20 minutes)

The reviewers assess each metric against its contract threshold.

For each metric: - Green: Meets or exceeds the contract threshold - Amber: Below threshold but within a defined acceptable tolerance; must be documented and monitored - Red: Breaches the contract threshold; requires a remediation decision

Any Red metric requires the review to address it explicitly in the output. It cannot be noted and deferred.


Step 3: Scope validation (15 minutes)

The reviewers assess whether the model is being used for the cases it was designed for.

Questions: - Have any new customer segments, products, or channels been added to the scope since the last review? - Has the input distribution shifted enough that the model's training population is no longer representative? - Is the model being used for decisions that were not explicitly included in its original validation?

Scope drift is the most common unreviewed risk in production models. A model validated for prime credit applicants being applied to thin-file applicants is not a model configuration error — it is a scope failure.


Step 4: Contract review (15 minutes)

The reviewers ask whether the existing contract thresholds are still the right thresholds.

This is the step most reviews skip. The original contract was written based on the business context and risk appetite at deployment. Both may have changed.

Questions: - Have regulatory requirements changed in a way that affects the performance bar? - Has the business risk appetite changed in a way that should raise or lower the acceptable error rate? - Have the decisions made by this model increased or decreased in materiality since the original contract was set?

If the thresholds need to change, that is a contract update — which requires the same sign-off process as the original contract.


Step 5: Output decision (10 minutes)

The review produces one of three mandatory outputs. No other outputs are acceptable.

✅ Approved: The model meets its performance contract, is applied within its defined scope, and is authorized to continue operating at its current autonomy tier.

⚠️ Conditional approval: The model has one or more Amber or Red findings, but continued operation is acceptable under defined conditions: specific remediation actions, a defined timeline, and a follow-up review date. Conditional approvals expire at the follow-up date — if remediation is not complete, the model defaults to suspension.

🛑 Suspended pending remediation: The model's performance or scope findings are severe enough that continued operation at the current tier is not authorized. The model must move to a lower tier or be taken offline until remediation is complete.

The output is documented in the governance log, communicated to all RACI stakeholders, and filed in the model inventory.

What a review is not

A review is not a dashboard check. Looking at monitoring dashboards and confirming metrics are green is monitoring. It is not a review. A review asks whether the metrics themselves are still the right metrics.

A review is not a retraining trigger conversation. Whether to retrain the model is a separate decision. The review determines whether the model can continue operating. Retraining is one possible remediation.

A review is not optional when metrics look fine. Metrics looking fine in aggregate is consistent with significant problems at the segment level. Scheduled reviews run on schedule.

The output documentation

Every review produces a one-page summary:

Field Content
Review date
Model name and version
Reviewer(s)
Evidence package status Complete / Incomplete (specify)
Performance findings Green / Amber / Red per metric
Scope findings In scope / Scope drift detected (specify)
Contract changes None / Updated (specify)
Output decision Approved / Conditional / Suspended
Conditions (if applicable) Remediation actions, timeline, follow-up date
Next scheduled review
Approving executive

This document is filed in the model inventory. It is the evidence that governance is functioning — not just monitoring.


Download the Architecture of Proof Checklist

Ready to implement? Get the definitive checklist for building verifiable AI systems.

Zoomed image
Free Download

Downloading Resource

Enter your email to get instant access. No spam — only occasional updates from Architecture of Proof.

Success

Link Sent

Great! We've sent the download link to your email. Please check your inbox.