All insights
CASE STUDY · FINANCIAL SERVICES · 6 MONTHS

Cutting model evaluation from weeks to a day.

A global Fortune 500 bank needed an evaluation harness that satisfied model-risk-management requirements at the cadence the business actually moved. We designed and shipped one over six months. It now runs every model change inside the audit channel the bank already had.

What we ship into
DiscoverDesignWireEvaluateShipHand backDiscoverDesignWireEvaluateShipHand back
14d → 1d
Evaluation cycle time
0
Internal data systems integrated
0
Outputs reproducible from logged inputs
0
Months from kickoff to production
The brief

A risk team boxed in by its own process.

The bank's risk-modeling group was working with three different LLM providers across half a dozen production use cases. Every change — vendor swap, model upgrade, prompt revision — required a full evaluation pass. The pass took two weeks. Most of the two weeks was spent reassembling fixtures, re-running comparisons by hand, and writing the audit memo.

The result was that the team was either slow or unsafe. Either they took two weeks for every change, or they shipped without the eval pass and prayed the model risk function didn't notice. Neither was acceptable.

They came to us asking for a system that gave them the eval cadence they needed without compromising the audit trail.

What we built

An evaluation harness inside the audit channel.

  • 01

    A versioned fixture system

    Every input the system has ever seen is captured, tagged, and replayable. New eval runs are diffs against prior runs.

  • 02

    Three internal data systems integrated

    The harness pulls reference data directly from the bank's golden sources rather than copying it. Lineage is automatic.

  • 03

    Vendor LLM gateway integration

    A single layer fronts the three model providers. Vendor swaps become a configuration change, not a code change.

  • 04

    Audit-channel publishing

    Eval results flow directly into the bank's existing model-risk audit channel. The MRM team reviews the same artifacts they were already reviewing.

  • 05

    Drift and regression detection

    Scheduled comparisons run nightly. The team learns about regressions from the dashboard, not from a customer.

  • 06

    Operator runbook

    A written runbook shipped with the system. Six months later, the bank's own engineers run the harness without us.

The outcome

Faster cadence, stronger audit posture.

The eval cycle dropped from two weeks to one day. Vendor swaps that used to take a quarter are now done in a week.

More importantly, the audit posture is materially stronger. The MRM team has structured eval evidence for every model change going back to launch. The team that operates the system can answer a regulator's reproducibility question without preparation.

Six months after launch, Proxiant transitioned the system fully to the bank's own engineers. We continue on a quarterly retainer for evaluation methodology and red-team support.

Engagement arc
DiscoverDesignWireEvaluateShipHand backDiscoverDesignWireEvaluateShipHand back

Have a similar brief?

Tell us where you are. We'll come back with a written shape and sized plan.