The Gap Technographics Leaves for Investment Firms

Try to track the technologies Microsoft uses, and the first thing that becomes clear is how scattered the question really is. Microsoft is microsoft.com, but it is also azure.com, github.com, linkedin.com, xbox.com, office.com, and hundreds of country and product subdomains. None of those addresses carry a built-in label saying this belongs to Microsoft, ticker MSFT. A crawler hitting azure.com on its own has no idea it is looking at part of Microsoft’s digital estate.

Detection is the easy part. Web crawlers can fingerprint what runs on a given page – the CDN, the analytics tags, the payments processor, the CMS – with reasonable reliability. The harder problem sits one layer up: rolling that detection back to the company, and from there to the listed issuer whose securities a fund might actually trade. Without that aggregation, web-derived technology data is interesting at the page level but not actionable at the investment level. An analyst cannot build a thesis around azure.com uses X. They need Microsoft uses X.

This is the gating problem for using web data in equity research. And for most of the past decade, the established technographics market has not solved it – not because the problem is impossible, but because the customers driving that market never asked it to.

A market built for sales, not investors

The vendors that have come to define commercial technographics were built around B2B sales and marketing intelligence. Their customers are sales reps targeting accounts that run a particular CRM, marketers sizing the addressable footprint of a SaaS product, and outbound teams looking for renewal triggers. For those use cases, domain-level matching is enough. A rep targeting companies using AWS does not especially care whether azure.com is correctly mapped to Microsoft the issuer; they care about the lead in front of them.

That commercial logic has shaped what the category has built. Detection at scale, yes. Coverage of major technologies, yes. CRM integration and account-based marketing workflows, certainly. But the work of resolving a sprawl of domains and subdomains to a single issuer-level identity – the layer that matters for portfolio analysis – has been left undone, because no one in the customer base was paying for it to be done.

For institutional data teams looking at web data as an alternative signal, that has been the persistent friction. The data exists, the vendors exist, but the entity-resolution layer needed to make any of it usable at portfolio level has had to be built in-house, one analyst at a time.

Starting from the entity, not the technology

Amsterdam-based Dataprovider.com is approaching the problem from the opposite direction. Rather than starting with a technology and asking which sites use it, the firm starts with the website itself and derives every data point it can from there – including, but not limited to, the technologies in use. The result is a structured layer that sits on top of the open web, mapping companies to their digital estates and digital estates back to companies.

“We already have what we call the company engine, which derives company information and entity resolution from a website,” says Mathijs Baas, sales engineer at Dataprovider.com. “That sounds simple, but it’s more complicated than you’d think. It allows us to find the ultimate ownership of these websites and map that to the company. From there, matching the ticker isn’t really hard anymore. The hard part was finding all the websites belonging to that entity.”

The engineering of that mapping is what Dataprovider.com calls clustering.

“We already had a function called Same Owner. If you take microsoft.com, you see sub-domains and main domains – assets belonging to Microsoft, even where the names differ. What we do is try to find the common denominator,” says Baas. “We do that based on a mix of public factual data and proprietary algorithms, and some have more weight than others. We train the model and attach a score for significance, ranging from an A score where there’s high probability all the way down to an F score. Previously that was on an input/output basis. The new version creates a separate column, so you don’t have to look it up each time. The mapping sits in the data itself, which lets you analyse adoption patterns at scale.”

That shift – from query-based lookup to a persistent column that travels with every record – is the difference between resolving one company at a time and being able to ask portfolio-level questions across thousands at once.

A classifier, with the caveats that brings

A machine-learning classifier handles change in the underlying data more gracefully than a static mapping would. When ownership shifts – through acquisition, rebrand, or a website moving infrastructure – the inputs the classifier uses change too, and the connection updates accordingly. Baas describes the result as auto-maintained, but he is upfront about the trade-off.

“The downside of a classifier is that it tries to get close to the absolute truth, but 100% accuracy is very hard to achieve with machine learning,” he says. “So we run active checks and balances to maintain accuracy.”

That kind of transparency is uncommon from vendors describing their own infrastructure, and it matters for how the data should be used. Entity resolution at scale is a probabilistic exercise, and the appropriate response is to treat the output as confidence-weighted rather than absolute. For a quantitative researcher building a backtest, that distinction is operationally important.

On top of the clustering layer sits a productised expression of the same work, called Recipes – pre-configured datasets that map technologies to relevant stock tickers, available via API for clients who want to apply the mapping to their own portfolios.

What panels make possible

Recipes is single-vendor analysis made easier. The more interesting application is what happens when the same entity-resolution layer is run across an index. Dataprovider.com is preparing to release this ‘panels’ capability, in pre-release at the time of writing.

“You’d get a drop-down saying ‘panel: Russell 2000’ and you’d get all the websites of all the companies in that panel,” says Baas. “From there you add fields – CRM, for example – to see what CRM technologies they use, what’s most common across the Russell 2000. You can stretch that out into a matrix that visualises where the most concentrated technology sits, whether there’s risk attached to a vendor concentration. You can even go back and see that a specific company is using HubSpot in March 2026 but was using a different CRM in March 2025.”

The historical depth – Dataprovider.com keeps four years of monthly snapshots – turns static technology detection into something closer to an adoption curve. Combined with sector classifications such as GICS or SIC, the same panel can be sliced to ask whether AI tooling is being adopted faster within particular industries inside the Russell 2000, or whether vendor concentration is rising in mid-cap retail compared with mid-cap industrials.

The decision to start with the S&P 500 and the Russell 2000 reflects where the demand has come from rather than where the technical possibilities end. Initial panels are hard-coded by Dataprovider.com; clients running custom portfolios can retrieve the underlying data through Recipes and apply their own ticker lists in their own environments.

Underneath the technical framing sits a point about the economics of adoption itself. Web-derived technology data captures every installation equally – a one-person SMB picking up a free tier sits in the same dataset as a multi-billion-dollar enterprise rolling out a global deployment – and that, treated naïvely, is a problem.

“Not every installation of a tech solution is equal. Size matters,” says Baas. “An SMB starting to use HubSpot has less impact than companies from the Russell 2000 doing the same.”

The panels approach narrows the lens to the cohort that matters for institutional analysis without discarding the underlying data – the long tail is still there for anyone who wants it.

What the data actually surfaces

That narrowing only matters if there is appetite for it. Baas suggests there is.

“We know firms that have already done this themselves – we just want to make it easier for them, because we have the domain knowledge. We know how our data works, what it contains, and how to train on it to build better classifiers,” he says.

What the underlying data already shows, in Dataprovider.com’s reading, is the kind of qualitative-but-grounded signal an analyst would otherwise have to triangulate from earnings calls and channel checks. Baas cites the AI tooling space as a current example.

“We got a lot of questions about Lovable, the AI vibe-coding platform. Based on the classifiers and combinations of fields we have, we can see that most of the websites using it are very small. These aren’t enterprise-level companies,” he says. “If you don’t have enterprise-level adoption, the technology is also easier to swap out – you haven’t integrated it deeply into a company in a way that’s hard to undo.”

The same lens, applied across the wider software stack, generates a pattern that maps onto the AI displacement narrative now driving software valuations.

“We saw this with the wider software sell-off. There are things that are going to be replaced or made easier by AI. From what we can see, the platforms that are hard to get rid of – Salesforce, for example – are still growing,” Baas says. “Single-layer workflows like sending an email through Mailchimp – we see decreases there, because that’s easier to replicate with AI.”

That is the kind of signal that web-resolved technology data is uniquely placed to provide. Adoption is observable from the outside; depth of integration is implied by which categories continue to grow and which decline; the question of whether a vendor is genuinely sticky or merely incumbent becomes, at least partially, an empirical one.

Where this lands

Web-technology data has been around long enough to feel familiar. What has held it back as an investment signal is not detection coverage but the layer above it – mapping the open web’s sprawl to the entities and tickers that investment workflows are organised around. Solving that layer at scale takes a problem that was tractable for sales teams and makes it tractable for portfolio managers.

Where it is most likely to surface real edge is in the corners of the market where conventional coverage thins out. Russell 2000 names, small and mid-cap European issuers, and other under-covered cohorts are the natural ground for any signal class that depends on observable behaviour rather than analyst attention. Mega-cap technographics is a crowded picture; small-cap technographics, properly resolved to the issuer, is closer to a frontier.

The category has been quietly waiting for someone to build the layer the incumbent vendors had no commercial reason to build. That work, on the evidence of what Dataprovider.com and others are bringing to market, is now happening.

Subscribe to our newsletter

Browse by brand

RegTech Insight

TradingTech Insight

Data Management Insight

Browse by content type

A-Team Insight Blogs

The Gap Technographics Leaves for Investment Firms – And What Comes Next

Share article

Related content

WEBINAR

Upcoming Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

BLOG

Alt Data’s New Competitive Edge: From Discovery to Synthesis

EVENT

Data Management Summit London

GUIDE

MiFID II Handbook – Second Edition

Share on Mastodon

A-Team Insight Blogs

The Gap Technographics Leaves for Investment Firms – And What Comes Next

Share article

Related content

webinars

Upcoming Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

Related content

WEBINAR

Upcoming Webinar: The Data Foundation for Alpha – How fragmented data is eroding hedge fund performance

BLOG

Alt Data’s New Competitive Edge: From Discovery to Synthesis

EVENT

Data Management Summit London

GUIDE

MiFID II Handbook – Second Edition