About a-team Marketing Services
The knowledge platform for the financial technology industry
The knowledge platform for the financial technology industry

A-Team Insight Blogs

Why Data Quality Isn’t Always What the Textbook Says It Is

Subscribe to our newsletter

The standard taxonomy for data quality – accuracy, completeness, consistency, timeliness, point-in-time integrity, lineage – is easy to find in any vendor deck or industry guide. It is also increasingly beside the point. What the institutional buy side actually measures, and what it cares about when money is on the line, diverges sharply from that tidy checklist, and the divergence has real consequences for how data is bought, monitored, and deployed, according to a panel at the recent A-Team/Eagle Alpha Alternative Data Conference in New York.

What emerged in this session, which  brought together Mike Soss, Chief Investment Officer at Millburn; Enrico Dallavecchia, President and COO at Arctium Capital Management; Samantha Mait, Data Science Operations Lead at Balyasny Asset Management; and Isabel Ahsler, Team Manager, Strategic Accounts at Sensor Tower, and was moderated by Brendan Furlong, Chief Data Advisory Officer at Eagle Alpha, was a consistent argument from three different institutional vantage points: data quality is not a property of a dataset at the point of delivery. It is a property of a workflow at the point of use. And the gap between those two definitions is where most of the operational pain in institutional data programmes actually lives.

The cultural chasm between data teams and the front office

The most pointed contribution of the session came from the buy-side systematic perspective. The textbook priority ordering – accuracy first, then timeliness – was flatly rejected in favour of a pragmatic inversion. As one panellist put it, for a front-office P&L seat, timely and slightly wrong beats 100 percent accurate and late. Known inconsistencies are manageable. A data pipeline that halts for manual intervention every time an oil price moves by an unusual amount is not.

That inversion points to something deeper than a philosophical disagreement. A panellist described a structural misalignment of incentives between data teams and the investment desks they serve: if a data error costs the desk a million dollars, the data team is held accountable and the consequences are loud; if the desk makes a hundred million dollars on timely data, little of that reward flows back to the team that delivered it. The result is a defensive operational culture in which data is held up at the first sign of anomaly because the downside of letting a bad point through is concentrated and visible, while the upside of letting a good point through is diffuse and unrewarded.

The worked example offered was the Coach-Macy’s re-labelling event of 2019, where a retail distribution deal caused roughly eight percent of Coach’s credit card transactions to reclassify. Quantitative strategies reading the data at face value shorted the name aggressively, only for Coach to report a stronger quarter than expected. The data was not wrong. The framework for interpreting it was. Any automated quality check would have passed the data through cleanly, because the dataset itself was internally consistent. The failure was in the layer above.

Quality as a property of understanding, not the dataset

One panellist, speaking from an allocator perspective, pushed the argument further. The most important quality attribute, in this view, is robustness – defined not as a dataset characteristic but as whether the user genuinely understands what the data represents and what it can support as an input to a decision. The speaker described encountering portfolio managers in due diligence meetings who would explain that their data team hands them analysis and they overlay their investment judgement on top. Those conversations, the panellist noted, tended to end within five minutes. A portfolio manager who has delegated understanding of the underlying data to someone else has, in effect, delegated the investment decision.

The same logic extends to vendor relationships. Trust in a vendor was treated with suspicion rather than satisfaction. If the answer to how data is validated is that the vendor handles it, the buyer has outsourced the responsibility to understand what they are actually using. The panellist described situations in which large institutions had purchased data for four or five years without anyone reviewing the quality checks performed at the original onboarding. The data had not degraded in any obvious way. But no one could have said so with confidence, because no one had looked.

Proprietary methodology for backfilling missing data points was flagged as a particular red line. A vendor unwilling to disclose how missing values are filled is, in the panellist’s framing, asking the buyer to take the core interpretive act of the dataset on faith. For a large institutional buyer, that is not a sustainable basis for a data relationship, regardless of how strong the headline numbers look.

The granularity problem

From a multi-manager operational perspective, another panellist identified a failure mode specific to scale. Large firms typically operate a tiered quality-check regime: basic ETL-level validation (row counts, nulls, duplicates) performed by data onboarding teams, and higher-level data science checks (year-on-year spend changes across a coverage universe, standard deviation bands on key metrics) performed by research functions. Both are necessary. Neither is sufficient.

The gap, it was argued, is that firm-wide thresholds are necessarily set to tolerate a degree of noise – otherwise, false-positive alerts would overwhelm the onboarding function. But a firm-wide threshold that tolerates one missing ticker out of a thousand in a dataset will not flag the loss of the single ticker that one pod happens to be trading that day. That team will be blindsided by a data integrity failure that the firm’s central QC regime correctly classified as below threshold. The implication is that investment teams themselves need to own a QC layer specific to their exposures, alongside explicit procedural workarounds for when data is late or compromised. Data quality, at the level of actual investment decisions, cannot be fully centralised.

Procurement and governance responses

The framing that quality is constituted at the point of use has direct implications for how data is contracted and monitored. Several practical disciplines surfaced during the discussion. A 90-day opt-out clause on a standard annual licence was cited as a defence against the recurring problem of production data diverging materially from trial data once payment has been made.

An annual refresh of quality validation on long-running vendor relationships was presented as a discipline that large shops in particular tend to neglect, on the assumption that data that was clean when onboarded has remained clean. And cross-dataset consistency checks – sanity tests between related metrics that should move in plausible relation to each other – were highlighted, from the vendor side, as a category of automated check that catches errors standard row-count monitoring does not.

The agent question

Closing the session, the panel was asked whether a well-trained AI agent could now handle the data quality workflow end to end. The consensus was that agents can plausibly handle the mechanical layer – row counts, null checks, duplicate detection – but that the interpretive layer remains beyond automation. The Coach-Macy’s example was offered as an illustration. An agent reviewing that dataset would find nothing wrong. The data was structurally clean. What was broken was the relationship between the data and the world it was meant to describe, and that relationship is precisely what the panel had spent the preceding hour arguing cannot be assessed from inside the dataset alone.

The warning was that the more data quality monitoring is automated, the more uniformly trusted its outputs become, and the more damaging the errors that slip through. Quality, on the panel’s reading, is ultimately a human discipline of understanding – and the firms that treat it as anything else are the ones most exposed when the understanding fails.

Subscribe to our newsletter

Related content

WEBINAR

Recorded Webinar: From Data to Alpha: AI Strategies for Taming Unstructured Data

Unstructured data and text now accounts for the majority of information flowing through financial markets organisations, spanning research content, corporate disclosures, communications, alternative data, and internal documents. While AI has created new opportunities to extract signals, many firms are discovering that value is constrained not by models, but by the quality of the content, architecture,...

BLOG

New Issue IQ and Boltzbit Partner to Slash Bond Issuance Data Processing Time by 74%

New Issue IQ, the solutions vendor dedicated to modernising primary bond markets, has announced a strategic partnership with deeptech AI company Boltzbit, to optimise the processing of new bond deal information. The collaboration reportedly delivers a processing-time improvement of approximately 74% by automating workflows that have traditionally been manual and fragmented. Through this integration, New...

EVENT

Eagle Alpha Alternative Data Conference, Spring, New York, hosted by A-Team Group

Now in its 9th year, the Eagle Alpha Alternative Data Conference managed by A-Team Group, is the premier content forum and networking event for investment firms and hedge funds.

GUIDE

Regulatory Data Handbook 2024 – Twelfth Edition

Welcome to the twelfth edition of A-Team Group’s Regulatory Data Handbook, a unique and useful guide to capital markets regulation, regulatory change and the data and data management requirements of compliance. The handbook covers regulation in Europe, the UK, US and Asia-Pacific. This edition of the handbook includes a detailed review of acts, plans and...