
The alternative data industry has largely settled the question of whether synthetic data has a place in institutional workflows. The harder question – and the one that dominated a lively panel discussion at the A-Team/Eagle Alpha Alternative Data Conference in New York – is where exactly that place is, and how far practitioners should trust what synthetic data tells them.
The panel, moderated by Jeffrey Maron, Managing Director at 7RIDGE, brought together Sarkis Agaian, Senior Quantitative Developer at Laurion Capital Management; Iro Tasitsiomi, Head of AI and Investments Data Science at T. Rowe Price; and Yifang Cao, Senior Quant Researcher at Jupiter Research Capital.What emerged was less a debate about whether to use synthetic data, but more a nuanced argument about matching generation methodology to the type of uncertainty being addressed. And a shared conviction that synthetic data’s most defensible role is in building confidence rather than discovering alpha.
Three buckets of uncertainty
One panellist offered a framework that effectively structured the rest of the discussion, organising synthetic data use cases into three tiers defined by the nature of what you don’t know.
The first tier covers situations where the rules are understood and the task is to test infrastructure against them. Traditional Monte Carlo simulations sit here, in that the user programmes the rules, generates data accordingly, and uses the output to validate pipelines, check data quality, or stress-test known parameters. This is the most established and least controversial use of synthetic data.
The second tier addresses situations where the underlying patterns exist in data but haven’t been explicitly defined. A recent paper from a major investment bank describing a large language model trained to generate synthetic trades was cited as an example: rather than encoding order book rules manually, the model ingests historical trade data and derives its own patterns, producing synthetic trades that reflect learned rather than programmed behaviour. The analogy offered was that what ChatGPT does for predicting the next word, this class of model does for predicting the next trade.
The third tier – and the one that generated the most debate – concerns situations where historical data provides little or no guide. Black swan events, geopolitical regime changes, and unprecedented market dislocations fall here. One panellist described an emerging approach using agent-based simulation, where AI-generated personas with defined characteristics interact within a simulated environment to produce probabilistic outcomes for scenarios that have never occurred. Not everyone on the panel was convinced. One view held that unknown unknowns are by definition beyond reach. You can worry about them, but there’s not much you can do. The counterargument was that this category represents precisely the kind of problem where synthetic data’s contribution could be most distinctive.The bias question proves unexpectedly contentious
When polled, more than half the audience identified bias as their biggest concern about synthetic data. The response drew a sharp reaction from the panel. One panellist described being almost annoyed by the result, pointing out that bias is pervasive in real-world data – from how a trading day is defined across time zones to how exchange outtrades are handled – and questioning what makes bias in synthetic data categorically different.
The exchange that followed was among the most revealing of the session. An audience member pushed back, arguing that the issue isn’t whether bias exists but who is hiding it. Another attendee framed the problem in more practical terms: working with senior housing demand data where historical records are thin, they had generated synthetic data to fill the gap but had no framework for evaluating whether the output was trustworthy.
The panel’s response reframed the problem. One panellist argued that rather than asking whether synthetic data is biased, the better question is whether the ontology – the definition of who and what belongs in the simulation – is biased. In agent-based approaches, this means scrutinising the knowledge graph of entities and relationships that underpins the simulation, not just the data it produces. Another panellist reinforced that if real-world training data contains bias, synthetic data derived from it will inevitably reproduce it, and that privacy-preserving techniques require their own rigorous validation, from membership inference testing to reidentification checks.
Model collapse and the tail problem
The discussion surfaced a more technically specific risk that the panel treated as underappreciated: tail distortion and the related concept of model collapse. As one panellist explained, generative models trained on synthetic or model-produced data tend to undersample rare events, because extreme tail observations are by definition unlikely to appear proportionally in training sets. Over successive generations, this creates a compounding loss of distributional variety, or what the panellist described in vivid terms as a kind of thermal death where the range of outputs narrows progressively toward the centre.
For financial applications, where tail events carry outsized economic significance, this represents a fundamental limitation. The panellist noted that mathematical tools exist to address the problem – change of measure techniques and related approaches – but questioned whether practitioners are routinely applying them. A counterpoint emerged from the agent-based simulation perspective: if the goal is to generate scenarios that have never been observed, the process is inherently diversifying rather than converging, potentially working in the opposite direction to model collapse.
Trust, not alpha
The strongest point of consensus concerned what synthetic data should and should not be expected to deliver. One panellist was emphatic: the criteria for judging synthetic data from a research perspective is whether it helps identify if you can break a strategy, not whether it generates alpha. If adding synthetic data makes a strategy look better, that should reduce rather than increase confidence. If it makes a strategy look worse, you wouldn’t trust it either. The value sits in between, in the robustness and sensitivity testing that builds conviction in a strategy’s durability.
This framing resonated across the panel. Synthetic data’s contribution, the panellist argued, is additive to trust and confidence, not a substitute for signal derived from real-world observation.
Evaluating vendors
On the question of whether to build synthetic data capabilities internally or source them from vendors, the panel offered practical guidance. One panellist outlined two scenarios where an external vendor might earn trust: either they have access to a proprietary alternative data source that they use synthetic data to anonymise and distribute, or they have genuinely superior modelling expertise that produces outputs closer to reality than an internal team could achieve. In either case, the panellist was clear that surface-level performance metrics are insufficient. A vendor needs to demonstrate the fundamentals and mechanisms underlying their synthetic data, not simply present favourable numbers.
The broader principle applies equally to internal efforts. As one panellist put it, you have to purposefully control your recipe, by understanding the parameters, the distributional assumptions, and the limitations of whatever generation methodology is employed. Blindly trusting synthetic data, whether from a vendor, an internal team, or your own earlier work, was treated as the single biggest trap the panel could identify.
Validation in practice
The panel touched briefly on what rigorous validation looks like, though time constraints limited the depth. The checklist offered included: univariate statistical validation across moments and full distributions; multivariate checks on joint distributions and conditional relationships; cross-validation using real-world versus synthetic training and test sets; and, for generative approaches, memorisation checks to ensure the model hasn’t simply reproduced its training data. On the privacy side, membership inference testing – whether an algorithm can determine with better than chance probability that a specific record was used in training – was highlighted as a minimum standard.
The session closed on the theme of trust and confidence, neatly capturing the panel’s central thesis. Synthetic data is neither silver bullet nor snake oil. Its value depends entirely on whether practitioners understand what it can and cannot do, and whether they impose the same rigour on generated data that they would demand of any other input to their investment process.
Subscribe to our newsletter


