How often should AI models be reviewed in production?

High-materiality models (those making consequential automated decisions) should have a formal performance review at least quarterly, with monthly monitoring against defined thresholds. Any drift alert, incident, or significant business change should trigger an unscheduled review. Lower-materiality models should be reviewed at least annually.

Who should conduct an AI model review?

Model reviews should be conducted by a function independent of the team that built and operates the model — typically model risk, an independent data science function, or a designated review board. The team that built the model provides technical context but should not be the primary reviewer of its own work.

What does an AI model review produce?

A model review must produce one of three explicit outputs: approved for continued operation (with documented rationale), conditional approval with a remediation timeline (specifying what must be fixed and by when), or suspension pending remediation (immediate action required before the model can continue operating). 'We reviewed it and it looks fine' is not an acceptable output.

The AI Model Review Playbook: A Step-by-Step Process for Production Models | AI Governance

This post is part of the Governance Operating Model pillar.

Most organizations monitor their models. Fewer review them.

The difference is consequential. Monitoring tells you whether current metrics are within acceptable ranges. A review asks whether the model's original assumptions still hold, whether its scope has drifted, whether its contracts are still the right contracts, and whether the evidence supports continued operation at the current autonomy tier.

This playbook defines the process for a structured model review from trigger to output.

When to run a model review

Scheduled reviews: - High-materiality models (automated decisions with significant financial, legal, or customer impact): quarterly - Medium-materiality models: semi-annual - Low-materiality models: annual - All models: after any major retraining

Triggered reviews (unscheduled): - Any drift alert that crosses the defined threshold - Any incident involving the model - Any significant change in business context (new customer segment, new product, regulatory change) - Any material change in input distributions - Any autonomy tier upgrade proposal

Who runs the review

Reviewer: A function independent of the model team — model risk, an independent data science review board, or a designated governance committee.

Participants: The model team (provides technical evidence), operations (provides outcome data), risk/compliance (confirms regulatory alignment), product (provides business context).

Accountable executive: The designated risk owner in the RACI — not the engineering manager or product lead.

The most common governance mistake in model review is letting the model team review its own model. The team that built the model has an incentive to find it acceptable and a blind spot for the assumptions they made during development.

The review evidence package

Before the review meeting, the model team assembles the evidence package. This is the primary input to the review. The package must contain:

AI Model Review Playbook: A step-by-step process showing five stages — Evidence Assembly, Performance Assessment, Scope Validation, Contract Review, and Output Decision — with required artifacts at each stage

1. Performance metrics against the original contract - Current AUC, precision/recall, or domain-specific performance metrics on fresh holdout data — not training data, not the original validation set - Calibration analysis: are scores still reflecting actual probabilities? - Segment-level breakdown: is performance uniform across the segments the model is applied to, or is it stronger in some and weaker in others? - Comparison to the baseline set at deployment: has performance improved, degraded, or remained stable?

2. Population stability analysis - PSI (Population Stability Index) on key input features: has the distribution of inputs changed since the model was trained? - Any new patterns in the input population that the model was not trained on? - Volume trends: is the model being called for a significantly different volume or mix of cases than at deployment?

3. Outcome linkage - For models where outcome data is available (credit default, fraud confirmation, claims paid): what do actual outcomes look like for decisions made by this model? - If the model is a ranking or scoring model, does the ordering of scores correlate with observed outcomes?

4. Scope compliance - Is the model being applied to cases within its defined scope? - Any evidence of scope creep — the model being used for decision types it was not validated for?

5. Contract compliance - Is the model meeting all the thresholds defined in its performance contract? - Were any circuit breakers triggered since the last review? If so, what caused them and how were they resolved?

6. Incident summary - Any incidents attributable to the model since the last review, their root causes, and the remediation actions taken

The five-step review process

Step 1: Evidence assessment (30 minutes)

Reviewers read the evidence package before the meeting. The meeting begins with any clarifying questions on the evidence — not with a summary of the evidence by the model team.

The first question: is the evidence package complete? If required data is missing — particularly fresh holdout performance or outcome linkage — the review cannot proceed until it is provided.

Step 2: Performance decision (20 minutes)

The reviewers assess each metric against its contract threshold.

For each metric: - Green: Meets or exceeds the contract threshold - Amber: Below threshold but within a defined acceptable tolerance; must be documented and monitored - Red: Breaches the contract threshold; requires a remediation decision

Any Red metric requires the review to address it explicitly in the output. It cannot be noted and deferred.

Step 3: Scope validation (15 minutes)

The reviewers assess whether the model is being used for the cases it was designed for.

Questions: - Have any new customer segments, products, or channels been added to the scope since the last review? - Has the input distribution shifted enough that the model's training population is no longer representative? - Is the model being used for decisions that were not explicitly included in its original validation?

Scope drift is the most common unreviewed risk in production models. A model validated for prime credit applicants being applied to thin-file applicants is not a model configuration error — it is a scope failure.

Step 4: Contract review (15 minutes)

The reviewers ask whether the existing contract thresholds are still the right thresholds.

This is the step most reviews skip. The original contract was written based on the business context and risk appetite at deployment. Both may have changed.

Questions: - Have regulatory requirements changed in a way that affects the performance bar? - Has the business risk appetite changed in a way that should raise or lower the acceptable error rate? - Have the decisions made by this model increased or decreased in materiality since the original contract was set?

If the thresholds need to change, that is a contract update — which requires the same sign-off process as the original contract.

Step 5: Output decision (10 minutes)

The review produces one of three mandatory outputs. No other outputs are acceptable.

✅ Approved: The model meets its performance contract, is applied within its defined scope, and is authorized to continue operating at its current autonomy tier.

⚠️ Conditional approval: The model has one or more Amber or Red findings, but continued operation is acceptable under defined conditions: specific remediation actions, a defined timeline, and a follow-up review date. Conditional approvals expire at the follow-up date — if remediation is not complete, the model defaults to suspension.

🛑 Suspended pending remediation: The model's performance or scope findings are severe enough that continued operation at the current tier is not authorized. The model must move to a lower tier or be taken offline until remediation is complete.

The output is documented in the governance log, communicated to all RACI stakeholders, and filed in the model inventory.

What a review is not

A review is not a dashboard check. Looking at monitoring dashboards and confirming metrics are green is monitoring. It is not a review. A review asks whether the metrics themselves are still the right metrics.

A review is not a retraining trigger conversation. Whether to retrain the model is a separate decision. The review determines whether the model can continue operating. Retraining is one possible remediation.

A review is not optional when metrics look fine. Metrics looking fine in aggregate is consistent with significant problems at the segment level. Scheduled reviews run on schedule.

The output documentation

Every review produces a one-page summary:

Field	Content
Review date
Model name and version
Reviewer(s)
Evidence package status	Complete / Incomplete (specify)
Performance findings	Green / Amber / Red per metric
Scope findings	In scope / Scope drift detected (specify)
Contract changes	None / Updated (specify)
Output decision	Approved / Conditional / Suspended
Conditions (if applicable)	Remediation actions, timeline, follow-up date
Next scheduled review
Approving executive

This document is filed in the model inventory. It is the evidence that governance is functioning — not just monitoring.

Governance Operating Model: The full framework for AI governance operations.
AI Governance RACI: Who owns model review and how ownership is structured.
How to Write AI Component Contracts: The contracts that model reviews evaluate.

Download the Architecture of Proof Checklist

Ready to implement? Get the definitive checklist for building verifiable AI systems.