When AI Meets SEC Filings: Where LLMs Deliver, Where They Don't, and Why the Plumbing Matters

Can LLMs reliably do what financial analysts do? Compare peers, track firms over time, and spot what’s changed? A panel at the recent A-Team Group/Eagle Alpha Alternative Data Conference New York explored the question, and new research from Goldman Sachs and Yale suggests the answer is more nuanced than the industry assumes.

The panel session “How to Use AI to Build Better Data Products” brought together Norman Niemer, Head of Research & Data Science at UBS, and Sid Ghatak, founder of Increase Alpha and former chief AI architect at the Federal Energy Regulatory Commission. The discussion, moderated by Andrew Delaney of A-Team Group, surfaced a set of operational tensions that anyone building or buying alternative data products should be paying attention to.

The headline question – where is AI delivering the most value in alternative data product development? – drew a range of responses. But the more revealing discussion concerned whether large language models belong at the core of financial data extraction pipelines, or whether they’re better deployed at the edges as productivity tools and presentation layers, while something more deterministic does the heavy lifting.

The Deterministic Pipeline vs. the LLM

One panellist described a data product built entirely on proprietary, non-LLM artificial intelligence: a deterministic extraction engine that pulls structured features from SEC filings to predict short-term returns. Because the system doesn’t use language models, it is technically incapable of hallucination, it runs the same way every time, and the output is completely auditable and compliant.

The panellist referenced Fin-RATE, a benchmark published in February 2026 by researchers at Goldman Sachs and Yale, which tests LLMs against three task types that mirror real analyst workflows: extracting facts from a single filing; comparing disclosures across companies; and tracking how the same firm’s filings change over time.

The headline finding is that LLMs are reasonably competent at single-document extraction, but degrade substantially on the tasks analysts actually spend their time on. When asked to compare disclosures across companies, models fabricated comparative claims and confused which data belonged to which entity. When tracking the same firm over time, they treated each year’s filing independently, producing temporal mismatches and invented trend claims. Finance-tuned models were particularly brittle. Although strong on single documents, they suffered from near-total collapse on cross-entity work.

Under retrieval-augmented conditions – the way most production systems actually work – performance dropped further still, and the researchers demonstrated this was primarily a retrieval problem rather than a generation one. The models performed adequately when given the right evidence; the bottleneck was surfacing it. For cross-company queries, the vast majority of questions suffered from missing evidence entirely. A hierarchical retrieval approach – pre-bucketing documents by company and year before searching – dramatically improved results, suggesting that how you index and organise your corpus determines whether the AI layer works at all. This maps directly to a point made repeatedly during the panel: the “unsexy” work of things like data organisation, entity resolution and temporal indexing, determines whether AI can deliver on its promise.

AI as Plumbing: The Unglamorous Work That Matters Most

The panel converged on a shared observation: AI’s most tangible impact in alternative data isn’t in signal generation. It’s in data engineering; the entity resolution, fuzzy matching, normalisation, and deduplication work that has historically been the most time-consuming part of onboarding alternative datasets.

One participant described using AI to match property reviews against a holdings database, a task previously too labour-intensive to justify. The gain wasn’t speed on an existing task; it was enabling analysis that wouldn’t have been attempted at all. Another described compressing days of what-if regression work into near-instant iterations.

But trust remains the binding constraint. Even participants who use AI extensively for entity resolution noted they still trust human verification more, particularly for historical mappings. AI can match a brand to a company entity today, but corporate structures and brand portfolios change over time. Maintaining point-in-time accuracy across a ten-year history is precisely the temporal mismatch problem that Fin-RATE documents formally.

There was also a candid acknowledgement that LLMs are fundamentally language models, and much of the data that needs cleaning is numerical, tabular, and structural. This aligns with Fin-RATE’s finding that finance-specific numerical errors – units and scales confusion, computation logic mistakes – constitute a persistent category of LLM failure.

One structurally interesting observation: dirty data is the primary obstacle to AI adoption in financial workflows, and AI itself may be the best tool for cleaning it. But this requires deliberate investment in what multiple participants called “ugly plumbing work,” i.e. effort that organisations often skip in favour of more visible applications. The panel’s audience poll confirmed the point: data quality and reliability topped the list of challenges, ahead of integration, explainability, and cost.

Competitive Implications

The panel explicitly addressed whether AI is a productivity tool or a capability enabler. The consensus was both, but the boundary matters strategically.

If AI dramatically reduces the cost of data engineering – onboarding, mapping, and normalising messy datasets – then smaller firms can now do work that previously required the infrastructure budgets of the largest shops. Data engineering used to be part of the alpha; if AI commoditises it, differentiation has to come from proprietary data sources, the quality of the analytical layer, or the speed and reliability of delivery. The perennial alternative data question – does the signal get competed away when everyone has the same tools? – takes on a new form.

For data products that use AI, trust and transparency go hand in hand. The panel discussion identified three things that build credibility with institutional clients: verifiable track records based on live predictions rather than backtests; full source citations that trace every claim back to a specific section of a specific filing; and honesty about which parts of the pipeline are deterministic and which rely on probabilistic AI. That last point matters more than ever. If LLMs fabricate comparative claims and invent trends – as Fin-RATE documents – then clients evaluating AI-powered data products need to know exactly where in the process those models are being used, and what validation sits around them.

What This Means in Practice

For those evaluating or building AI-powered data products, several conclusions emerge. Single-document LLM extraction is meaningful but imperfect; multi-document synthesis – comparison and longitudinal tracking – is where models fail systematically. If a vendor claims AI-powered comparative analysis, the right questions concern validation methodology, hallucination rates, and entity alignment.

Retrieval architecture may matter more than model selection for production systems. How documents are indexed, bucketed, and queried has a larger impact on accuracy than which LLM generates the answer.

Data quality work remains the critical enabler. AI can help, but it requires deliberate investment and doesn’t happen by default. Organisations that skip the plumbing build on unstable foundations.

And the deterministic-vs-probabilistic architectural choice is commercially significant. Products built on deterministic pipelines offer auditability and consistency; those on LLMs offer flexibility and breadth. Understanding where each applies is essential for informed procurement.

Subscribe to our newsletter

Browse by brand

RegTech Insight

TradingTech Insight

Data Management Insight

Browse by content type

A-Team Insight Blogs

When AI Meets SEC Filings: Where LLMs Deliver, Where They Don’t, and Why the Plumbing Matters

Share article

Related content

WEBINAR

Upcoming Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

BLOG

The Case for Multimodal Macro: Toward a New Standard for Economic Measurement

EVENT

RegTech Summit London

GUIDE

MiFID II handbook, third edition – How compliant are you?

Share on Mastodon

A-Team Insight Blogs

When AI Meets SEC Filings: Where LLMs Deliver, Where They Don’t, and Why the Plumbing Matters

Share article

Related content

webinars

Upcoming Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

Related content

WEBINAR

Upcoming Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

BLOG

The Case for Multimodal Macro: Toward a New Standard for Economic Measurement

EVENT

RegTech Summit London

GUIDE

MiFID II handbook, third edition – How compliant are you?