Why Data Quality Isn't Always What the Textbook Says It Is

The standard taxonomy for data quality – accuracy, completeness, consistency, timeliness, point-in-time integrity, lineage – is easy to find in any vendor deck or industry guide. It is also increasingly beside the point. What the institutional buy side actually measures, and what it cares about when money is on the line, diverges sharply from that tidy checklist, and the divergence has real consequences for how data is bought, monitored, and deployed, according to a panel at the recent A-Team/Eagle Alpha Alternative Data Conference in New York.

What emerged in this session, which brought together Mike Soss, Chief Investment Officer at Millburn; Enrico Dallavecchia, President and COO at Arctium Capital Management; Samantha Mait, Data Science Operations Lead at Balyasny Asset Management; and Isabel Ahsler, Team Manager, Strategic Accounts at Sensor Tower, and was moderated by Brendan Furlong, Chief Data Advisory Officer at Eagle Alpha, was a consistent argument from three different institutional vantage points: data quality is not a property of a dataset at the point of delivery. It is a property of a workflow at the point of use. And the gap between those two definitions is where most of the operational pain in institutional data programmes actually lives.

The cultural chasm between data teams and the front office

The most pointed contribution of the session came from the buy-side systematic perspective. The textbook priority ordering – accuracy first, then timeliness – was flatly rejected in favour of a pragmatic inversion. As one panellist put it, for a front-office P&L seat, timely and slightly wrong beats 100 percent accurate and late. Known inconsistencies are manageable. A data pipeline that halts for manual intervention every time an oil price moves by an unusual amount is not.

That inversion points to something deeper than a philosophical disagreement. A panellist described a structural misalignment of incentives between data teams and the investment desks they serve: if a data error costs the desk a million dollars, the data team is held accountable and the consequences are loud; if the desk makes a hundred million dollars on timely data, little of that reward flows back to the team that delivered it. The result is a defensive operational culture in which data is held up at the first sign of anomaly because the downside of letting a bad point through is concentrated and visible, while the upside of letting a good point through is diffuse and unrewarded.

The worked example offered was the Coach-Macy’s re-labelling event of 2019, where a retail distribution deal caused roughly eight percent of Coach’s credit card transactions to reclassify. Quantitative strategies reading the data at face value shorted the name aggressively, only for Coach to report a stronger quarter than expected. The data was not wrong. The framework for interpreting it was. Any automated quality check would have passed the data through cleanly, because the dataset itself was internally consistent. The failure was in the layer above.

Quality as a property of understanding, not the dataset

One panellist, speaking from an allocator perspective, pushed the argument further. The most important quality attribute, in this view, is robustness – defined not as a dataset characteristic but as whether the user genuinely understands what the data represents and what it can support as an input to a decision. The speaker described encountering portfolio managers in due diligence meetings who would explain that their data team hands them analysis and they overlay their investment judgement on top. Those conversations, the panellist noted, tended to end within five minutes. A portfolio manager who has delegated understanding of the underlying data to someone else has, in effect, delegated the investment decision.

The same logic extends to vendor relationships. Trust in a vendor was treated with suspicion rather than satisfaction. If the answer to how data is validated is that the vendor handles it, the buyer has outsourced the responsibility to understand what they are actually using. The panellist described situations in which large institutions had purchased data for four or five years without anyone reviewing the quality checks performed at the original onboarding. The data had not degraded in any obvious way. But no one could have said so with confidence, because no one had looked.

Proprietary methodology for backfilling missing data points was flagged as a particular red line. A vendor unwilling to disclose how missing values are filled is, in the panellist’s framing, asking the buyer to take the core interpretive act of the dataset on faith. For a large institutional buyer, that is not a sustainable basis for a data relationship, regardless of how strong the headline numbers look.

The granularity problem

From a multi-manager operational perspective, another panellist identified a failure mode specific to scale. Large firms typically operate a tiered quality-check regime: basic ETL-level validation (row counts, nulls, duplicates) performed by data onboarding teams, and higher-level data science checks (year-on-year spend changes across a coverage universe, standard deviation bands on key metrics) performed by research functions. Both are necessary. Neither is sufficient.

The gap, it was argued, is that firm-wide thresholds are necessarily set to tolerate a degree of noise – otherwise, false-positive alerts would overwhelm the onboarding function. But a firm-wide threshold that tolerates one missing ticker out of a thousand in a dataset will not flag the loss of the single ticker that one pod happens to be trading that day. That team will be blindsided by a data integrity failure that the firm’s central QC regime correctly classified as below threshold. The implication is that investment teams themselves need to own a QC layer specific to their exposures, alongside explicit procedural workarounds for when data is late or compromised. Data quality, at the level of actual investment decisions, cannot be fully centralised.

Procurement and governance responses

The framing that quality is constituted at the point of use has direct implications for how data is contracted and monitored. Several practical disciplines surfaced during the discussion. A 90-day opt-out clause on a standard annual licence was cited as a defence against the recurring problem of production data diverging materially from trial data once payment has been made.

An annual refresh of quality validation on long-running vendor relationships was presented as a discipline that large shops in particular tend to neglect, on the assumption that data that was clean when onboarded has remained clean. And cross-dataset consistency checks – sanity tests between related metrics that should move in plausible relation to each other – were highlighted, from the vendor side, as a category of automated check that catches errors standard row-count monitoring does not.

The agent question

Closing the session, the panel was asked whether a well-trained AI agent could now handle the data quality workflow end to end. The consensus was that agents can plausibly handle the mechanical layer – row counts, null checks, duplicate detection – but that the interpretive layer remains beyond automation. The Coach-Macy’s example was offered as an illustration. An agent reviewing that dataset would find nothing wrong. The data was structurally clean. What was broken was the relationship between the data and the world it was meant to describe, and that relationship is precisely what the panel had spent the preceding hour arguing cannot be assessed from inside the dataset alone.

The warning was that the more data quality monitoring is automated, the more uniformly trusted its outputs become, and the more damaging the errors that slip through. Quality, on the panel’s reading, is ultimately a human discipline of understanding – and the firms that treat it as anything else are the ones most exposed when the understanding fails.

Subscribe to our newsletter

Browse by brand

RegTech Insight

TradingTech Insight

Data Management Insight

Browse by content type

A-Team Insight Blogs

Why Data Quality Isn’t Always What the Textbook Says It Is

Share article

Related content

WEBINAR

Recorded Webinar: Navigating the Build vs Buy Dilemma: Cloud Strategies for Accelerating Quantitative Research

BLOG

When Margin Moves Upstream: How TT is Reworking Trading Decisions After the OpenGamma Deal

EVENT

Data Management Summit London

GUIDE

AI in Capital Markets Handbook 2026

Share on Mastodon

A-Team Insight Blogs

Why Data Quality Isn’t Always What the Textbook Says It Is

Share article

Related content

webinars

Upcoming Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

Related content

WEBINAR

Recorded Webinar: Navigating the Build vs Buy Dilemma: Cloud Strategies for Accelerating Quantitative Research

BLOG

When Margin Moves Upstream: How TT is Reworking Trading Decisions After the OpenGamma Deal

EVENT

Data Management Summit London

GUIDE

AI in Capital Markets Handbook 2026