About a-team Marketing Services
The knowledge platform for the financial technology industry
The knowledge platform for the financial technology industry

A-Team Insight Blogs

AI Agents Need Better Data, Not Bigger Models – Daloopa Benchmark

Subscribe to our newsletter

AI-powered fundamental and historical data provider Daloopa has published new benchmark research examining how well leading AI agent systems perform on real-world financial research tasks. Titled Benchmarking AI Agents on Financial Retrieval, the study evaluates whether recent advances in agentic AI translate into reliable outcomes when accuracy matters most.

The benchmark focuses on a core question facing financial institutions experimenting with AI agents: are improvements in model reasoning enough, or does accuracy depend more fundamentally on the data and infrastructure those agents rely on?

Daloopa’s findings point clearly to the latter.

Testing Agentic AI on Financial Retrieval

The benchmark evaluates three frontier agent frameworks: OpenAI’s Agents SDK with GPT-5.2, Anthropic’s Agent SDK with Claude Opus 4.5, and Google’s Agent Development Kit (ADK) with Gemini 3 Pro. Each was tested on 500 financial questions designed to reflect real research tasks, such as retrieving specific reported financial figures.

The study compares agent performance when relying on public, web-sourced information versus when retrieving data from a structured financial database. This distinction is central to the research. Rather than benchmarking language models in isolation, Daloopa tests full agent workflows that combine search, reasoning, and retrieval.

The results highlight a sharp divergence between the two approaches.

Structured Data Drives Accuracy Gains

Across all three agent systems, accuracy improved materially when agents retrieved data from a structured database rather than the public web. Daloopa reports that accuracy rose to roughly 90 per cent, representing improvements of up to 71 percentage points compared with web-only retrieval.

The finding helps explain why AI agents continue to struggle in high-stakes financial domains despite visible advances in model capability. Access to verifiable, well-structured financial data remains a limiting factor.

As Daloopa CEO Thomas Li puts it, “Accuracy in AI-driven finance isn’t just a model problem; it’s a data access problem.”

The benchmark does not claim that agents have reached human-level performance or that they can be deployed autonomously without oversight. Instead, it shows that data provenance and structure materially influence whether agents retrieve the correct answer at all.

90 Percent Won’t Do

In many consumer or general knowledge applications, 90 per cent accuracy might be sufficient. In finance, the residual error rate remains material. Incorrect financial figures can affect valuations, research conclusions, or downstream risk assessments.

Crucially, the benchmark finds that the remaining errors are not random. They cluster around identifiable, repeatable issues related to financial structure rather than language comprehension.

Two factors stand out: fiscal calendar alignment and naming conventions. Agents performed better on US companies than on non-US companies, largely because US issuers more often use December fiscal year-ends that align with the standard calendar. Non-US companies, which more frequently report on non-December fiscal cycles, introduced additional complexity that agents struggled to resolve consistently.

The implication is not that agents fail because finance is inherently opaque, but that financial meaning is encoded in conventions that are often implicit in unstructured sources.

Calculation is a New Error Surface

Daloopa’s benchmark focuses on retrieval rather than calculation, but the company is already looking beyond that boundary. According to Li, extending agent capabilities introduces a new class of challenges.

“Our current roadmap involves extending capabilities from simple retrieval to more complex calculations. Convention-selection errors naturally arise within the realm of complex calculations, so this intuition is well-aligned with where things are headed.”

As agents move from retrieving reported numbers to calculating derived values, such as accruals or valuations, they must apply market conventions that vary by instrument, geography, and context. Errors in convention selection can produce outputs that appear numerically plausible while being economically incorrect.

This introduces a different category of risk from simple retrieval mistakes.

Positioning Convention Logic

Li outlines three approaches to encoding convention logic in agentic systems, each with trade-offs.

The default approach:  Implicit learning within the model itself avoids additional engineering overhead but introduces variability and non-determinism. As Li notes, different models may encode convention knowledge differently depending on training and post-training processes.

in-context learning: A second approach is to supply convention logic through in-context learning, sometimes described as “skills”. This can be implemented quickly through prompts or external documentation, but it does not scale well when convention permutations multiply across jurisdictions and instruments. Token limits and efficiency become constraints.

Model Context Protocol:  The third approach is to externalise convention logic through Model Context Protocol (MCP)-based components. This enables deterministic calculators, branching logic, and richer workflows, but it comes with higher engineering and maintenance costs.

Rather than promoting a single solution, Li frames the choice as a function of complexity. If convention logic scales non-linearly across companies, geographies, or products, implicit or in-context approaches may prove insufficient.

Measuring Beyond Retrieval Accuracy

The benchmark raises questions about how agent performance should be measured as use cases expand. Retrieval accuracy captures only part of the problem once agents are expected to interpret or apply financial rules.

Asked whether convention correctness could become a first-class benchmark metric, Li responds cautiously but openly: “Potentially, yes. At Daloopa, data accuracy of all types is a core priority, so elevating convention correctness to a primary benchmark metric would be consistent with that focus.”

Such a shift would reflect a broader understanding of accuracy in finance, encompassing not just whether an answer is retrieved, but whether it is derived using the correct assumptions.

Another issue highlighted by the research is how agents behave under uncertainty. Language models tend to provide an answer even when inputs are ambiguous or incomplete.

Li frames this primarily as a product and design question: “By default, language agents tend to prefer verbosity and will attempt to answer even with uncertain inputs.” He notes that this behaviour can be constrained through system-level instructions or in-context guidance, allowing agents to avoid overconfident guesses when key conventions are unclear.

A Grounded Reference Point

Daloopa’s benchmark provides empirical evidence that improvements in financial AI accuracy depend less on incremental model gains and more on data structure, metadata, and convention handling. It shows that agentic systems can approach high accuracy when grounded in structured, auditable data, while also making clear why further gains require infrastructure rather than inference.

As financial firms evaluate how and where to deploy AI agents, the research offers a grounded reference point: progress hinges on making financial rules explicit and machine-enforceable, not merely on asking models to reason harder.

Subscribe to our newsletter

Related content

WEBINAR

Recorded Webinar: Navigating a Complex World: Best Data Practices in Sanctions Screening

As rising geopolitical uncertainty prompts an intensification in the complexity and volume of global economic and financial sanctions, banks and financial institutions are faced with a daunting set of new compliance challenges. The risk of inadvertently engaging with sanctioned securities has never been higher and the penalties for doing so are harsh. Traditional sanctions screening...

BLOG

FCA Advances Bond CT to Auction Stage; ESMA Confirms fairCT as First EU Bond-CTP

The UK Financial Conduct Authority (FCA) has opened the price auction phase for shortlisted bidders for its bond market consolidated tape (CT), a single, real time feed of post trade bond data drawn from every UK trading venue and Approved Publication Arrangement (APA). The initiative is intended to make bond markets “more transparent, efficient and...

EVENT

RegTech Summit New York

Now in its 9th year, the RegTech Summit in New York will bring together the RegTech ecosystem to explore how the North American capital markets financial industry can leverage technology to drive innovation, cut costs and support regulatory change.

GUIDE

The DORA Implementation Playbook: A Practitioner’s Guide to Demonstrating Resilience Beyond the Deadline

The Digital Operational Resilience Act (DORA) has fundamentally reshaped the European Union’s financial regulatory landscape, with its full application beginning on January 17, 2025. This regulation goes beyond traditional risk management, explicitly acknowledging that digital incidents can threaten the stability of the entire financial system. As the deadline has passed, the focus is now shifting...