FinAgent-Bench
FinAgent-Bench is t54's public, reproducible, auto-scored benchmark for evaluating personal-finance AI agents.
FinAgent-Bench is t54's public, reproducible, auto-scored benchmark for personal-finance AI agents. It is the evaluation harness we use to measure how well a financial agent actually performs before it is trusted with real-world decisions. The project is available on GitHub at t54-labs/FinAgent-Bench.
The benchmark answers two questions about any financial agent:
- Is its knowledge solid? Accuracy across the US CFP Board's official knowledge domains, reported as a comparable "CFP-equivalent score."
- Does it do the job reliably? Performance on high-frequency practical tasks such as credit repair, credit-card analysis, debt optimization, and repayment-plan generation.
Two themes run through every item. First, compute correctly: money math must be done with tools and verified against a deterministic reference implementation, never the model's mental arithmetic. Second, stay within guardrails: the agent must refuse non-compliant requests, disclose assumptions, and avoid advice that is mechanically optimal but unsafe.
What It Tests
FinAgent-Bench is organized into a knowledge layer and an applied layer.
The knowledge layer is aligned to the CFP Board's eight Principal Knowledge Domains. Investment Planning (A4), the highest-weight domain, is currently implemented, and the remaining domains are designed and queued.
The applied layer covers four practical capabilities that financial agents are most often asked to perform.
| Code | Capability | What it probes |
|---|---|---|
| B1 | Credit Repair | FCRA disputability judgment and compliance guardrails |
| B2 | Credit Card Analysis | Utilization, the minimum-payment (negative-amortization) trap, and optimal payoff |
| B3 | Debt Optimization | Avalanche vs. snowball strategy and the behavioral trade-off |
| B4 | Repayment Plan Generation | Month-by-month amortization tables under cash-flow constraints |
Design Principles
FinAgent-Bench is built to be objective and hard to game.
- Traceable framework. Knowledge items map to the official CFP Board domains and weights.
- Objective first. Scoring prefers machine-checkable formats, and numeric answers are verified against a Python reference implementation.
- Anti-contamination. Item banks store parameters rather than answers; the gold answer is recomputed at scoring time, and parameters can be randomized to generate variants.
- Process and outcome. Scoring inspects not just the final answer but whether the right tool was called and whether guardrails held.
The framework and scoring are agent-agnostic, with adapters for OpenAI Codex and Claude Code, so the same items and scorers can evaluate any agent on equal footing.
Relationship To Trustline
FinAgent-Bench gives t54 an objective, reproducible measure of agent competence and safety. Trustline's underwriting decisions depend on knowing whether an agent computes correctly and stays within guardrails, and FinAgent-Bench turns that question into a score that can be tracked, compared, and regression-tested as agents evolve.