28

FinAgent-Bench

FinAgent-Bench is t54's public, reproducible, auto-scored benchmark for evaluating personal-finance AI agents.

FinAgent-Bench is t54's public, reproducible, auto-scored benchmark for personal-finance AI agents. It is the evaluation harness we use to measure how well a financial agent actually performs before it is trusted with real-world decisions. The project is available on GitHub at t54-labs/FinAgent-Bench.

The benchmark answers two questions about any financial agent:

  1. Is its knowledge solid? Accuracy across the US CFP Board's official knowledge domains, reported as a comparable "CFP-equivalent score."
  2. Does it do the job reliably? Performance on high-frequency practical tasks such as credit repair, credit-card analysis, debt optimization, and repayment-plan generation.

Two themes run through every item. First, compute correctly: money math must be done with tools and verified against a deterministic reference implementation, never the model's mental arithmetic. Second, stay within guardrails: the agent must refuse non-compliant requests, disclose assumptions, and avoid advice that is mechanically optimal but unsafe.

What It Tests

FinAgent-Bench is organized into a knowledge layer and an applied layer.

The knowledge layer is aligned to the CFP Board's eight Principal Knowledge Domains. Investment Planning (A4), the highest-weight domain, is currently implemented, and the remaining domains are designed and queued.

The applied layer covers four practical capabilities that financial agents are most often asked to perform.

CodeCapabilityWhat it probes
B1Credit RepairFCRA disputability judgment and compliance guardrails
B2Credit Card AnalysisUtilization, the minimum-payment (negative-amortization) trap, and optimal payoff
B3Debt OptimizationAvalanche vs. snowball strategy and the behavioral trade-off
B4Repayment Plan GenerationMonth-by-month amortization tables under cash-flow constraints

Design Principles

FinAgent-Bench is built to be objective and hard to game.

  • Traceable framework. Knowledge items map to the official CFP Board domains and weights.
  • Objective first. Scoring prefers machine-checkable formats, and numeric answers are verified against a Python reference implementation.
  • Anti-contamination. Item banks store parameters rather than answers; the gold answer is recomputed at scoring time, and parameters can be randomized to generate variants.
  • Process and outcome. Scoring inspects not just the final answer but whether the right tool was called and whether guardrails held.

The framework and scoring are agent-agnostic, with adapters for OpenAI Codex and Claude Code, so the same items and scorers can evaluate any agent on equal footing.

Relationship To Trustline

FinAgent-Bench gives t54 an objective, reproducible measure of agent competence and safety. Trustline's underwriting decisions depend on knowing whether an agent computes correctly and stays within guardrails, and FinAgent-Bench turns that question into a score that can be tracked, compared, and regression-tested as agents evolve.