Infrastructure Monitoring: Mapping Technical Performance to Business Performance

By Mike O’Hara, Special Correspondent, TradingTech Insight.

High-performance infrastructure and connectivity technologies have become essential components of trading architectures in today’s electronically traded markets. And as trading systems become ever faster, increasingly interconnected, and process more and more data, there has never been a greater need for accurate and comprehensive monitoring of the infrastructure that underpins them. But infrastructure monitoring is no longer just about troubleshooting potential system and connectivity problems. Innovative firms are now gathering performance-related data to drive analytics that can also significantly boost business performance.

This topic and others will be discussed during a panel session at A-Team Group’s TradingTech Summit Virtual. The panel – Optimising trading infrastructure for high performance in fast markets – features speakers from Credit Suisse and sponsors CJC, BSO and Options IT, and will explore how firms can derive analytical value from the data they collect on the performance of the trading systems and connectivity networks.

So how can firms gain new and unique insights into their own activities and those of their clients by tapping into the rich seams of data points from their trading networks, and develop proprietary analysis that can differentiate their offerings in the marketplace?

Why infrastructure monitoring is important

Business guru Peter Drucker is often quoted as saying, ‘If you can’t measure it, you can’t improve it’. This is particularly true in the world of electronic trading, where firms across the industry rely on a complex, interconnected network of high-performance systems to conduct their business.

Being able to monitor exactly how these systems are operating, and how they are interacting with each other, is essential in today’s markets, says Naftali Cohen, Chief Revenue Officer at MayStreet, a market data technology company.

“The primary benefit of infrastructure monitoring is preventing unintended consequences and performance bottlenecks,” he says. “It is also important to consider the behaviour and performance characteristics of the applications that are being connected (whether serving or consuming applications), the interactions between applications and connectivity, and different connectivity technologies within the same architecture.”

This is not a trivial task. Comprehensive infrastructure monitoring involves not only keeping track of the performance of hardware, networking, operating systems and application components, but also looking at metrics around market data and order flow.

“Monitoring high-performance connectivity in electronic trading is critically important to properly maintain both the physical networking infrastructure and to have a full understanding of their application-level behaviour,” says Tim Field, VP Engineering at HPR (formerly Hyannis Port Research), a company that builds capital markets infrastructure. “This requires hardware timestamping to nanosecond resolution accompanied with analytics that highlight a range of potential problems, from a client overloading the message rate on a single exchange session to a custom TCP trading stack misbehaving on a trading server. Without this level of monitoring, these issues would likely go unnoticed, resulting in latency degradation and system instability.”

Where is the value?

In the past, infrastructure monitoring was seen as a necessary cost rather than something that could provide added value. But that is now changing, according to David Gaul, Head of Service Assurance at BSO, a global infrastructure and connectivity provider.

“Building mission-critical infrastructure for capital markets is more than just connectivity,” he says. “Monitoring and acting on the output is a key part of measuring effectiveness and maintaining optimal performance, but the biggest challenge for any firm is what to do with all the outputs from the various monitoring systems. It is the analysis of the data that poses the biggest challenge for teams seeking to boost performance in support of their business. We believe that in addition to automation, AI and machine learning can play a huge part in this for businesses. AI and machine learning can help teams in the processing of the data and help orientate it to the business environment to improve decision making and hence increase their overall performance and improve their trading objectives.”

“The data that’s being pulled from the?systems is now leading investment decisions a lot more than it was in the past, when it was basically a race to zero, where everyone wanted to get the best possible latency to the venue,” says Roland Hamann, Chief Technology Officer at financial technology vendor Pico. “Technological advanced trading firms are trying to identify, which monitoring data will provide them what additional value in their decision making – meaning how is that dataset useful for us?”

This is a question that more firms are asking, leading to an increasing trend of firms monitoring IT metrics in real-time, and combining data from multiple sources to provide more meaningful observation and analysis, says Steve Rodgers, Chief Technology Officer at Beeks Analytics, specialists in the monitoring and performance of trading systems.

“One thing we’re seeing is rather than having lots of silos of data, like a separate network monitoring system, separate application agent-based monitoring, separate network operations alerting, and so on, there’s a trend where people are now bringing it all together and unifying the different layers of data they’re gathering into an observability stack, and taking a more holistic approach to infrastructure monitoring,” says Rodgers. “That’s where the value starts to come in, as you gain useful insights from the relationships between the data.”

For forward thinking firms, monitoring is no longer an afterthought, it’s included in the design of their entire trading infrastructure. This can enable a style of monitoring akin to the telemetry used in Formula One, says Steve Hicks, Chief Technology Officer at flow monitoring specialist Instrumentix.

“The only way that F1 teams are able to achieve the performance that they need is by using the data from sensors on all the different components of their cars and analysing it. And that’s the same thing with trading platforms, because they’re highly distributed. So you can think of these different tiers like SORs, algos, feed handlers, execution gateways, client connectivity gateways etc., as components of such a car, and likewise you need sensors to pick up that data. In the past, it was far more challenging to deliver and effectively integrate that rich granular data, but it is now possible.”

From the ‘what’ to the ‘why’

Gaining visibility into such data can be illuminating, according to Hicks. “The reality is that problems are rarely due to the network, although the poor network guys usually have to prove it,” he says. “It’s almost always the servers in the plant – and more importantly, what’s running on them – where the issues tend to be, but many flow monitoring systems are completely incapable of integrating host data. And as a result of that, they’re lacking this big spotlight that can shine a light on where the problems are. And that’s of critical importance. The point is, it’s not a good idea to just light up your network, but have these dark servers, because that’s actually oftentimes where things are going wrong.”

Being able to observe everything in real time can pre-empt problems before they occur, says Rodgers. “It can help with business functions if you can leverage the real-time observability, to be able to provide you with enough insight to increase your profitability. That’s the goal. But certainly, how people are using this at the minute is to provide a proactive means of resolving issues before they happen. If you’ve got potential problems with a liquidity provider, or a broker, or someone else you cross-connect to in the trading datacentre for example, you can trap them in real time, and have the processes in place to react without having to spend days forensically analysing things.”

Andrew Ralich, CEO of OneZero, a company that develops enterprise trading technology, points out that the statistics generated from infrastructure monitoring are very much the ‘canary in the coal mine’. “They give you the ‘what’ but they don’t give you the ‘why’,” he says. “That’s an important, distinctive point. You can look at the load of network hardware, you can look at the CPU on your host infrastructure, you can look at the data IO, but that tells you nothing about whether the market is surging in its activity and pace and quoting. Is there some client out there who has an algo gone wild, and he’s actually just peppering you with orders, for example? So it doesn’t give you the ‘why’ necessarily. It’s the leading indicator – both in real time and retroactive – that you need to dig further. But it’s not the sole answer to understanding what’s going on. And without this data, you miss the ability to proactively understand what’s going on in the system. But you need to do other things in order to really drive value out of it. And infrastructure monitoring alone isn’t going to allow you to tune and understand and build out your infrastructure.

“Speed of execution has always been a critical factor while the pace of the markets keeps increasing,” continues Ralich. “We’re fighting a multi-dimensional battle. And we’re doing that with a multi-dimensional data set that gets all the way down into basically recording and persisting indefinitely, the raw data coming into the system and being able to use that post facto to figure out the ‘why’.”

Three pillars of observability

To provide a solid foundation for full stack observability throughout a trading infrastructure, firms can draw upon three pillars of observability; logs, tracing, and telemetry. Log data can come from applications, network components, operating system components, middleware, trading software, or a wide range of other sources.

“There’s a balance to be had in instrumenting & logging everything,” says Beeks’ Rodgers. “You can be overwhelmed by the volume of the data that generates. So you need to find that happy balance between having the capability to observe the points that are important and being able to go down to the necessary level of detail.”

Tracing involves following a specific event as it flows through the infrastructure. One example might be a market data tick as it enters the premises, goes through a switch, into a feed handler and into a trading engine. Another example might be an order, where there are potentially many more points of measurement as the order lifecycle propagates, from order generation, through submitting to an exchange, going through the exchange gateway into the matching engine and coming back as a fill.

“Tracing also is about cause and effect,” says Rodgers. “So it’s not just those individual points in time for a particular transaction, it might be stitching a number of them together. So a tick could generate multiple orders, which go through multiple smart order routers, and so on and so forth, or multiple ticks might cause multiple events. So it’s having the ability to stitch those traces together. We call it connecting data in motion.”

The third pillar, telemetry, is possibly the most complex.

“Telemetry is providing real time and historical metrics about the underlying quality of both the infrastructure and the applications, and the sessions of data within those,” says Rodgers. “So for example, if you’ve got market data, looking at the message rates, burstiness, gaps, the quality of the TCP sessions to the venues, for example. And rather than looking at that purely at the infrastructure layer, you might start to track analytics at the instrument or symbol layer as well. There’s a lot of stuff that can be built around that, once you’ve got that observability, in that you can identify things that could be done better, as well as where there might be problems.”

Bringing it all together

Much of this data can be generated from things like taps, agg switches and net probes, although vendors such as Instrumentix also offer lightweight, software-based traffic probes utilising kernel bypass, which are lower in footprint.

“Software-based latency analytics can correlate various message types, measuring not just the performance of one application but how that application works in concert with other applications on the same network such as exchange matching engines, risk gateways and network infrastructure,” says HPR’s Tim Field. “Today’s market participants need to look beyond overall message rates, however. They need to understand how application-level network protocols are packed into lower-level protocols — TCP, IP, Ethernet and so on — to identify when packet-level issues occur and how it impacts their latency.”

Understanding latency is critical, particularly at the high frequency end of the market.

“The more latency sensitive you are, as a trading firm, the more granular you need to get in terms of these taps, as you come into and out of a switch, into and out of a network card, into and out of a cross-connect,” says Bill Fenick, VP Enterprise at Digital Realty, a global data centre company. “So, all those ingress and egress points have to be measured if ultra-low latency is something that you fundamentally rely on. And given the fact that time is one of the variables to factor into your algorithms, if you’re able to get very accurate monitoring of specific steps of the trading logistics chain, you can feed that into your algorithms to enable you to know what happens in certain circumstances.”

An important aspect of this is being able to map and correlate what’s happening on the network against what’s happening from the business perspective, for example by mapping network packet data against order data. This can be challenging, but is possible with the right tools and the right approach.

“Data from a flow monitoring platform can – where appropriate – be offloaded to a big data platform, if it’s available,” says Hicks. “And the reason is, that it’s a related area of analysis that is at times best served by the tooling available in those platforms, such as Apache Spark, or KX to analyse time series data, and so on.”

Combining data sources in this way can serve as a launchpad for much wider analysis, says Donal O’Sullivan, MD and Global Head of Product Management at Pico. “It’s about understanding these massive amounts of data,” he says. “So all of the information leading up to an order being generated, all of the market data, everything to do with that order, including any child orders, the business outcomes. So essentially things like fill rates, and transaction cost analysis, and all the performance-related data, and putting that into a massive multi-dimensional database. And then, once you’ve done that, that’s when the real fun begins, because now you have the IT teams, you have the risk and compliance guys, you’ve got the software devs, and they all want their own ability to query and look for patterns in that data. But they’re all asking slightly different questions. So what we’ve seen is the emergence of much more flexible, high-performance database tools, rather than just network capture, or network monitoring. Teams are looking for the ability to run machine learning algorithms over this data to do all sorts of stuff. And that’s where you move from real-time problem solving to much deeper value extraction.”

Monitoring of low-level infrastructure data is a key part of this, says O’Sullivan. “The quality of the data that the network guys are working from is much better, because it’s real time,” he says. “It’s extremely granular, and it’s typically nanosecond accurate. In the past, it was all of those things, but it was not in a form that other teams could use, so they were using software logs on a T+1 basis. And they were also completely missing the IT performance dimension in their analysis. Now, what they’ve realised is that they can use this data from the network to create a much higher quality data record, so they’re extracting the data from these IT and network tools, and then putting it into the types of databases that they like to use and that are suited for this type of flexible, multi-dimensional interrogation.”

This approach can deliver meaningful performance-related analytics, says MayStreet’s Cohen: “Performance-related analytics combine infrastructure-level monitoring and performance data (e.g., bandwidth utilisation, one-way and round-trip delays, packet drop and retransmission rates, and microburst detection), with application- and business-level data (e.g., order sizes and direction, news or social media activity, derivative or underlying market activity). Innovative technologies, such as pattern recognition and machine learning, are being used to predict application and business signals from infrastructure performance and vice-versa. Conversely, business insight can be used to optimise infrastructure performance, e.g., by elastic provisioning of connectivity, bandwidth or compute capacity ‘just in time’ for predictable demand.”

Visual analysis via dashboards can be used to help firms understand and provision for future capacity, says Steve Moreton, Global Head of Product Management at CJC, a market data consultancy and service provider. “Giving someone a visual dashboard to understand capacity is useful, however it’s still a human-led process,” he says. “When a client has hundreds of servers, machine-led processes save time and add a reliability layer. CJC has embedded PCA (Principle Component Analysis), which provides a great visual dashboard to quickly show which servers have been over or underutilised over the last week. We have built this to pro-actively message the monitoring system if PCA thinks a server is particularly over-utilised.

“Firms are looking to visualise more and more of their servers’ and applications’ behaviour and there is a continual process to have a holistic view of every system,” continues Moreton. “Some important metrics are held in entitlement systems such as DACS or EMRS, as well as inventory management systems. Some of this information is very useful in correlating what the user is doing to how that user affects the servers which provide the data. Knowing a particular business unit has started to use more OTC data, for example, can trigger some pro-active capacity management. The ultimate goal of capacity management is to ensure that the system runs effectively and stops any crashes before they happen. Nothing can impact execution rates worse than an outage.”

Real world examples

Clearly, infrastructure monitoring can help firms ensure their trading platforms are running smoothly, efficiently, and at optimal performance levels. But how are firms actually using their infrastructure data to enhance business outcomes in the real world?

“One of the big business improvements we have seen is the time and material savings,” says Moreton. “Things that could take engineers or analysts half a day or a full day’s worth of work, sifting through database exports is now done in seconds. Having better power over your data means that the engineer/analyst has more time to work on business-critical projects. It also means that these duties could be migrated to different teams as what was once complicated, is now easy.”

“Firms can take advantage of their internal data and analyse it against public market data to better understand the quality of their executions versus the market as a whole,” says MayStreet’s Cohen. “We see firms using this analysis to demonstrate execution quality to their clients. Additionally, this analysis also provides insights into micro-market structure that can guide algo development. Quantitative analysts can tailor their algos based on the characteristics of internal flow and how that flow interacts with the marketplace.”

OneZero’s Ralich offers a good example of this: “If you’re a bank, with a whole bunch of clients, who you serve with liquidity, you might have different liquidity profiles for different clients, and you’ll structure your quotes accordingly, depending on the type of client. You can analyse how they’re interacting with that liquidity from an infrastructure perspective, in terms of speed of hitting quotes, and so on, to help you better create the liquidity book for that client. That takes, first and foremost, the recording of the data. Second, a real investment in data science and analytics to learn how to properly understand it. And then third, a breadth of client base and global range to be able to understand the different profiles of streams and performance-impacting inputs to the system that can exist.

“A good example is looking at a client’s liquidity profiles, their incoming streams coming from LPs together with their outgoing pricing, and being able to have a global perspective”, continues Ralich. “For example, somebody might be drinking from a firehose, doing a tremendous amount of complex pricing transformations, and then outputting a feed with very specific spread ranges for their clients, which ticks maybe 100 times a second. But they’re building that off a feed that ticks 40,000 times per second. Knowing that, understanding and having the data to look at how they’ve built their pricing function, even working with their LPs, allows them to right-size those streams to make sure they’re getting the most out of the hardware in the system.”

Beeks’ Rodgers notes that extending observability out from one’s own firm to include other market participants can offer clear benefits. “You might find that certain liquidity providers show certain patterns around traffic throttling, for example, or bandwidth throttling, so you want to be able to see when certain patterns kick in, and take appropriate action. And this is where the telemetry comes in, by being able to provide this real time monitoring and feeding it into the trading application. So not just having an infrastructure dashboard, but providing these stats into your trading application to be able to make proactive decisions on. Measurements of your own or your counterparties reaction times aren’t meaningful unless you’re combining application data with real-data from the wire, and increasingly clients are looking to include these real-world times in their tick and trade repositories in addition to their application timings.”

Liquidity providers can use this kind of data proactively, says Digital Realty’s Fenick. “If you’re running your infrastructure in a collocated environment, you can provide the evidence to show that if a client’s application was sitting next to yours, you could provide them with more timely and competitive quotes and prices, for example. And you can use that as a selling point to say, ‘through your desktop, we see you hitting on prices and you currently have an average latency of 500 microseconds, whereas if you cross-connect your application right next to our application, you’d have a deterministic latency profile of 50 microseconds, and you would get a more attractive price’. So for the bulge bracket banks and liquidity providers (LPs), it’s a more proactive way to go to market and attract more flow.”

Instrumentix’s Hicks cites a good example of telemetry in action. “One client was struggling with disconnections of a feed handler at times of increased market activity, and they had extremely limited visibility of what was going on. So by tracking TCP behaviour, traffic volumes, and connection reset events in real time, and putting that alongside flow monitoring, we were able to show that the feed handler was unable to keep up with the amount of traffic that it was consuming. The TCP IP stack was winding the window right down to the point where the feed handler was getting disconnected. After showing that to the client, they were able to parallelise the consumption in the feed handler more effectively by improving the threading and providing faster compute. The situation was resolved, and it ended with them improving the speed and reliability of their price consumption.”

Not falling foul of regulatory requirements is another area where innovative use of infrastructure monitoring can benefit firms, says O’Sullivan at Pico. “One example of something that we built is around order to trade ratio (OTR) restrictions on trading participants, which looks at your current OTR and predicts what your ratio is going to be by the end of the day. Firms don’t want to be told by an exchange that they’ve violated that ratio. So we built a predictive algorithm that looks at the real-time data from the start of the trading day. And within about half an hour of trading, it can tell you that if you keep going like this, you’re going to get slapped with a fine by this evening. And that’s obviously something that’s possible for everyone to calculate when the day is over, but by then it’s too late.”

Conclusion

In conclusion, it’s clear that the need for accurate and comprehensive monitoring of trading infrastructures is only going to get more pronounced as electronic markets evolve. Firms across the buy side and the sell side, as well as trading venues and service providers, are increasingly realising this, and leading firms are now using infrastructure monitoring not just to diagnose problems, but to differentiate their services in the marketplace and to stay competitive.

Pico’s Hamann sums things up nicely. “Anyone who hasn’t started to collect monitoring data and analysing it will lose out in the long run. If you’re a pension fund for example, and you’re sending your orders to three brokers and getting them executed, wouldn’t you want to know if any of the changes (code or infrastructure) that they’re doing on a regular basis has an influence on your order execution quality per broker or overall? Or if you’re a broker and can identify to a client a fat finger error or algorithm gone rogue instantly before these orders get executed or cause large financial damage. That is invaluable. The client will always be thankful. Many other industries are collecting monitoring data nowadays too to improve processes, predict quality issues or analyse customer behaviour. Every type of market participant should, to some extent, be capturing their own environmental and monitoring data and analysing it. And if they’re not, they should start to invest in it because it will pay off very quickly.”

Subscribe to our newsletter

Browse by brand

RegTech Insight

TradingTech Insight

Data Management Insight

Browse by content type

A-Team Insight Blogs

Infrastructure Monitoring: Mapping Technical Performance to Business Performance

Share article

Related content

WEBINAR

Recorded Webinar: Enhancing trader efficiency with interoperability – Innovative solutions for automated and streamlined trader desktop and workflows

BLOG

Thoma Bravo to Acquire Verint, Forging AI-Powered Customer Experience Giant with Calabrio

EVENT

ExchangeTech Summit London

GUIDE

RegTech Suppliers Guide 2020/2021

Share on Mastodon

A-Team Insight Blogs

Infrastructure Monitoring: Mapping Technical Performance to Business Performance

Share article

Related content

webinars

Upcoming Webinar: Data platform modernisation: Best practice approaches for unifying data, real time data and automated processing

Related content

WEBINAR

Recorded Webinar: Enhancing trader efficiency with interoperability – Innovative solutions for automated and streamlined trader desktop and workflows

BLOG

Thoma Bravo to Acquire Verint, Forging AI-Powered Customer Experience Giant with Calabrio

EVENT

ExchangeTech Summit London

GUIDE

RegTech Suppliers Guide 2020/2021