A-Team Insight Blogs

Scraping at Scale: Where AI Actually Helps, and Where It Doesn’t Yet

29 May 2026

Subscribe to our newsletter

“AI web scraping that doesn’t break” was the title given to a fireside chat at the recent A-Team/Eagle Alpha Alternative Data Conference London – a phrasing that is, on its own terms, aspirational. Pipelines that self-maintain through site changes, schema drift and content shifts remain a destination rather than a current reality, and the session was candid about the distance still to be travelled.

Nico Smuts, Investment and Data Science Leader and formerly of Citadel, was in conversation with Jahmal Nicholson, Senior Data Scientist and Discretionary Data Product Lead at Man Group. Between them they mapped where AI is genuinely changing the operating model for institutional scraping, and where the vendor narrative has run ahead of practice.

AI is widely credited with democratising scraping: tools once reserved for engineering teams are now usable by non-coders, and the field opens up. The picture that emerged from the conversation was more layered. Wider access at the discovery end is producing centralisation at the production end – two sides of the same shift, not opposing trends.

Self-service at the front, centralisation at the back

The conversation turned early to how the development and maintenance burden is shifting inside a large alternative investment manager. The pattern described will be familiar. Discretionary PMs and analysts who once had to queue for engineering resource now run their own initial scrapes, inspect the data and judge whether it is useful. The work transfers to the central data team only when a PM wants the scrape maintained on an ongoing basis.

The structural consequence cuts against the democratisation framing. Legacy scrapes that historically sat inside PM pods or business-unit engineering teams are being pulled back to central ownership. The logic is operational: when a PM-owned scrape breaks, it lands on the central team anyway, so it pays to standardise tooling, monitoring and schemas from the outset. Work is in progress on internal “skills and commands” to standardise scrape setup itself – schema, extraction logic, basic plumbing – freeing central capacity to source new scrapes rather than nurse old ones.

For institutional data teams, this is a more accurate read of what AI is doing to the operating model. The org chart is not flattening; the line between exploratory and production work is being redrawn, with more of the latter pushed into a governed central function.

Detection is not healing

A second area where the session was usefully precise concerned what AI actually does in the maintenance loop. The most spectacular scrape failures – where a site’s structure changes and the pipeline goes dark – are also, paradoxically, the most tractable. AI now reconciles collected data against the live site, locates where the information has moved to, and flags the discrepancy. One recent failure of this kind was caught and rectified through a combination of AI-driven detection and an existing data quality framework.

That is a meaningful capability, but it is monitoring and reconciliation, not self-healing. Continuous AI-driven self-healing of pipelines remains exploratory. The setup and detection layer has matured; autonomous remediation has not.

The distinction matters because vendor narratives routinely collapse the two. For institutional data teams evaluating AI-augmented scraping tools, the practical question is whether a product closes the loop or whether it intelligently flags a human to close it. On the evidence here, the latter is the current state of the art inside a serious institutional user.

The semantic-failure gap

The sharpest unresolved question concerned a different category of failure: the case where the data keeps flowing, the website structure is unchanged, but the underlying signal has shifted. The content looks the same; the meaning does not.

The discussion reached for derived data quality frameworks – monitoring the quality of signals derived from raw data, in addition to the raw data itself. That is the correct architectural answer, and it is more than many practitioners do. But it speaks to the infrastructure of detection more than the methodology. How an institutional data team systematically identifies that a stable-looking input has changed meaning is a harder problem, and one the industry has yet to settle.

It is the question worth carrying forward. Semantic drift is the failure mode most likely to corrupt a signal silently, and the most resistant to the kind of structural change-detection AI now handles well. It is also where the gap between vendor capability claims and operational practice is widest.

Compliance as the binding constraint

A final thread concerned the so-called arms race between scrapers and data owners. The framing is everywhere in the broader discourse: AI lowers the cost of circumventing anti-scraping measures, AI also lowers the cost of detecting automated scraping, and both sides accelerate.

Inside an institutional manager, the binding constraint is internal policy, not technical capability. Strict rules govern what can be scraped in the first place. Sites that sit behind a login, or that impose captchas, trigger a compliance and commercial review before any scraping is attempted. AI’s institutional use case here was characterised as defensive rather than offensive – automatic detection and anonymisation of personally identifiable information (PII) at ingest, alerts for compliance review – not as a means of defeating site protections.

That is a structural point the arms-race framing tends to obscure. For institutional data teams, technical possibility sets the outer bound; the operative question is what falls within a governance perimeter the firm is willing to operate inside.

Use-case stratification

One concrete deployment observation is worth highlighting. For discretionary equity strategies, web-scrape data continues to function as one input among many – decision-support rather than signal. On the macro and fixed income side, the pattern looks different: scrape data is becoming more integral to thesis monitoring, with actionable alerts triggered when the conditions underpinning a position no longer hold. That is a more demanding role for the data, and a more demanding standard for the pipeline that delivers it. It is also where the gap between detection and self-healing matters most.

Subscribe to our newsletter

Market & Alt Data Insight

WEBINAR

Recorded Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

Alpha depends on more than models, talent and execution. It depends on the quality, consistency and timeliness of the data behind every investment decision. Many hedge funds still operate with fragmented datasets, inconsistent identifiers and manual reconciliation processes that slow research, distort signals and increase operational risk. As firms scale across strategies, regions and asset...

Find out more

23 June 2026

Market & Alt Data Insight

BLOG

Synthetic Data’s Real Value Isn’t Alpha – It’s Confidence

The alternative data industry has largely settled the question of whether synthetic data has a place in institutional workflows. The harder question – and the one that dominated a lively panel discussion at the A-Team/Eagle Alpha Alternative Data Conference in New York – is where exactly that place is, and how far practitioners should trust...

15 April 2026

Market & Alt Data Insight

EVENT

RegTech Summit New York

Now in its 10th year, the RegTech Summit in New York will bring together the RegTech ecosystem to explore how the North American capital markets financial industry can leverage technology to drive innovation, cut costs and support regulatory change.

19 November 2026

RegTech Insight

GUIDE

The Data Management Implications of Solvency II

Bombarded by a barrage of incoming regulations, data managers in Europe are looking for the ‘golden copy’ of regulatory requirements: the compliance solution that will give them most bang for the buck in meeting the demands of the rest of the regulations they are faced with. Solvency II may come close as this ‘golden regulation’:...

03 April 2012

Data Management Insight RegTech Insight

Browse by brand

Market & Alt Data Insight

TradingTech Insight

Digital Assets & Tokenisation Insight

Data Management Insight

RegTech Insight

Browse by content type

A-Team Insight Blogs

Scraping at Scale: Where AI Actually Helps, and Where It Doesn’t Yet

Share article

Related content

WEBINAR

Recorded Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

BLOG

Synthetic Data’s Real Value Isn’t Alpha – It’s Confidence

EVENT

RegTech Summit New York

GUIDE

The Data Management Implications of Solvency II

Share on Mastodon

A-Team Insight Blogs

Scraping at Scale: Where AI Actually Helps, and Where It Doesn’t Yet

Share article

Related content

webinars

Recorded Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

Related content

WEBINAR

Recorded Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

BLOG

Synthetic Data’s Real Value Isn’t Alpha – It’s Confidence

EVENT

RegTech Summit New York

GUIDE

The Data Management Implications of Solvency II