AI agent case study

AI Agent for AP Coding & Three-Way Matching Case Study

A $180M B2B services company deployed an AI agent that reads incoming invoices, suggests GL coding with confidence scores, performs three-way matching, and routes exceptions — all under a documented SOX-aware control framework with reviewer sign-off, evidence packs, and PCAOB-aligned governance.

Client profile: Composite case study based on a $180M B2B services company on NetSuite + Bill, ~3,800 invoices / month, ~3,200 active vendors, mix of PO and non-PO spend. PE-backed; SOX 404 readiness ahead of IPO; CFO mandate to deploy AI in finance with proper controls.

Company context

The client is a $180M B2B services company on NetSuite + Bill, processing ~3,800 invoices monthly across ~3,200 vendors. AP team of 6 was 80% transactional; CFO wanted to redirect that capacity to strategic AP work (vendor management, discount capture, working capital optimization). The CFO also wanted to demonstrate AI-in-finance maturity ahead of the IPO; investors expected it.

AI in AP has a control-design problem. Auditors' baseline question: "What stops the agent from miscoding a $50K bill?" Answer requires confidence thresholds, reviewer routing, evidence packs, change management, and PCAOB-aligned governance. Without those, the AI agent is an audit finding waiting to happen.

  • $180M B2B services
  • NetSuite GL + Bill (BILL.com)
  • ~3,800 invoices / month
  • ~3,200 active vendors
  • 6-person AP team
  • PE-backed; pre-IPO
  • SOX 404 readiness program in flight
  • CFO mandate: AI in finance with controls

Before — what was actually broken

  • Manual GL coding by AP team (80% of their time)
  • Three-way matching done manually with PO lookup
  • Exception triage entirely human
  • No discount capture beyond opportunistic
  • Vendor master cleanup sporadic
  • Audit fieldwork for AP took 3 days

What Ledger Summit implemented

  • AI coding agent: pre-trained on 18 months of historical invoice/coding pairs; confidence-scored suggestion per line item
  • Confidence threshold: ≥85% auto-code with reviewer sample-test; <85% routed to human reviewer
  • Three-way match agent: PO + receipt + invoice match with tolerance configuration; exception classification (price variance, qty variance, missing receipt, etc.)
  • Vendor matching: fuzzy match against vendor master with reviewer override on duplicates
  • Evidence pack per invoice: invoice image, AI suggestion + confidence, reviewer action, GL posting
  • Confidence threshold review: monthly reviewer reviews 25 random sub-threshold invoices for accuracy
  • Change management: any rule / threshold / model change requires documented reason + reviewer approval + audit committee notification if material
  • SOX control framework: per-invoice trail; preparer ≠ approver; segregation of duties; quarterly testing
  • PCAOB-aligned governance: methodology memo, walkthrough script, design effectiveness testing
  • Auditor concurrence on design before deployment
  • Quarterly precision/recall metrics tracked: false-positive (AI codes when human wouldn't) vs. false-negative (AI doesn't code when it could)
  • Human-in-the-loop for high-risk categories (entertainment, advertising, capitalization decisions)
  • Continuous training: weekly model update from accepted/rejected suggestions

AI agent governance mechanics

LayerControl
Scope definitionWhat the agent is allowed to code; account list whitelist; dollar threshold
Confidence threshold≥85% auto-code; <85% human review; threshold reviewed quarterly
Sample testingMonthly review of 25 random AI-coded invoices for accuracy
Reviewer routingThreshold-based + amount-based + GL-account-based routing rules
Exception queueThree-way match exceptions classified and routed
Evidence packInvoice + AI suggestion + confidence + reviewer action + GL posting per transaction
Change managementAny rule / threshold change documented; reviewer approved; audit committee notified if material
SOX testingQuarterly walkthrough; design + operating effectiveness testing
PCAOB alignmentMethodology memo; auditor concurrence on design; SOX 404 attestation walkthrough
Continuous improvementWeekly model retrain from accepted/rejected; quarterly metrics review
Human-in-the-loopHigh-risk categories always reviewed; complex modifications routed

Implementation timeline

  • Weeks 1–2: Discovery: invoice volume analysis, GL coding patterns, vendor master audit
  • Weeks 3–4: AI agent design: model selection, training data preparation, confidence threshold determination
  • Weeks 5–6: Pilot deployment: 200 invoices / week shadow-mode alongside human coding
  • Weeks 7–8: Calibration: precision/recall measurement; threshold tuning; vendor master cleanup
  • Weeks 9–10: Full deployment with reviewer routing; SOX control framework activation
  • Weeks 11–12: Hypercare; monthly sample testing process; auditor walkthrough

Measured results

MetricBeforeAfterDelta
Invoices auto-coded by AI0%~92%+92 pp
AP team transactional time80%25%−55 pp
Avg invoice cycle8 days1.4 days−83%
Three-way match auto-pass62%88%+26 pp
Discount capture~$8K / yr~$320K / yr+$312K
Audit fieldwork days (AP)31.5−1.5 days
SOX walkthroughN/A (manual process)Clean
Quarterly model accuracyTracked > 95% precision

Alternatives considered

OptionTimeCostStrengthsWeaknesses
Stampli (AI-native AP)3 months$240K–$420K + licenseModern AILicense + replacement of Bill
Tipalti AI3 months$240K–$420K + licenseStrong globallyReplaces Bill
Bill native AI features$0 incrementalAlready in stackCoverage thinner
Custom build (Anthropic Claude / OpenAI)5 months$320K–$520KFull controlMaintenance + governance
Ledger Summit + Bill + custom AI agent (selected)12 weeks$180K–$280KRight-sized; SOX-clean; preserves BillMaintenance ongoing

When this approach fits

  • $50–500M companies with material invoice volume (1,000+ / month)
  • Existing AP automation (Bill, Tipalti, etc.) on top of which to add AI
  • SOX or audit-readiness pressure
  • PE-backed or pre-IPO with control framework expectations
  • Willingness to invest in governance + control framework
  • Auditor open to AI-in-finance walkthroughs

Lessons learned

  • Confidence threshold is the lever. 85% works for most coding; high-risk categories warrant 95% or human-only.
  • Sample testing is non-negotiable. 25 invoices / month is the audit-friendly cadence.
  • Auditor concurrence on design before deployment. Methodology buy-in saves remediation later.
  • Change management for thresholds and models. Documented reason for any change; audit committee notification if material.
  • Human-in-the-loop on high-risk. Entertainment, advertising, capitalization, related-party — always reviewed.

Frequently asked questions

How does the AI know how to code GL accounts?

Pre-trained on 18 months of historical invoice/coding pairs from your tenant; learns vendor patterns, line-item descriptions, and historical reviewer corrections.

What if the AI is wrong?

Confidence threshold catches uncertain cases; human reviewer signs off. Below threshold: never auto-coded.

How does this satisfy SOX?

Documented design (scope, threshold, routing); operating effectiveness testing (sample + walkthrough); evidence per invoice; auditor concurrence on methodology.

What about hallucination risk?

Output validated against known vendor master, GL accounts, and PO data; confidence-threshold catches uncertain cases. Hallucination is a confidence-score signal.

Does this work in QuickBooks / Xero / smaller GLs?

Conceptually yes; specific platform integration depth varies. Best-fit on NetSuite / Sage Intacct currently.

What about international tax / VAT?

Out of scope for this implementation; multi-currency handled per Bill / NetSuite native; VAT requires separate workstream.

How do you train new vendors?

Few-shot learning from initial reviewer corrections; full integration after ~10 invoices.

Can the AI suggest journal accruals at month-end?

Yes — accrual suggestion is a related workflow; same control framework applies.

What about IPO / public-company SOX 404(b)?

Yes — the framework supports auditor attestation. Methodology memo + design + operating testing required.

How does this compare to Stampli?

Stampli is AP automation with embedded AI; we layer AI agent on top of Bill (existing stack). Trade-off: maturity vs. integration depth.

Want AI in AP without giving up SOX-clean controls?

A 30-minute call walks your AP volume and tells you what AI agent + governance looks like.

Book a free call