As efforts to reduce data transmission latency begin to hit the speed of light barrier (the wireless networks being rolled out are pretty much the end game), attention is turning away from the speed of trade execution and towards the speed of trade construction and optimisation – decision latency as some call it. Addressing decision latency has moved the performance focus to within the data centre, and how to squeeze the maximum from servers, typically configured as tightly coupled clusters, running application code designed to take advantage of parallel processing.
To reduce decision latency, servers need to be able to perform complex calculations and analytics – such as monte-carlo simulations and vector arithmetic – and processes such as sorting and manipulation of large datasets. As a result there is a need to optimise both compute processing as well as data I/O.
Several technology approaches are currently gaining traction to improve compute and data management performance, including overclocking CPUs, tightly coupling CPU and storage, interconnecting servers with PCI Express, deploying solid state Flash storage on the main RAM bus, implementing in-memory data stores, and leveraging parallel processing capabilities within the server or cluster.
Parallelism is focused on processing logic that can be broken down into component parts, and executing those components alongside one another, rather than one after another (that’s called serial processing). While not all logic lends itself to the parallel approach (simple processing of financial market data feeds is an example), much of the compute and data manipulation related to pre-trade analytics, trade construction and risk management is well suited.
When it comes to implementing parallelism at the hardware level, a couple of technology approaches have been most commonly adopted: multi-core CPUs from Intel or AMD, or Graphics Processing Units from Nvidia.
In recent times, Intel has invested heavily in multi-core CPUs, with its current “Ivy Bridge” generation of Xeon chips having as many as 15 x86 cores – with each core capable of executing two logic threads simultaneously. Intel has been working closely with developers of financial markets applications – including Kx Systems, Redline Trading Solutions and Tibco Software – to ensure that their offerings make best use of multi-core architectures.
Meanwhile, GPUs – or more correctly GPGPUs (for General Purpose Graphical Processing Units) – are massively parallel by design and have been adopted for a number of applications, notably by the likes of BNP Paribas for derivatives pricing and JPMorgan Chase for risk management, and by analytics provider Hanweck Associates for greeks calculations for options data feeds from a number of exchanges.
Tests conducted last year by Xcelerit, which has developed a toolkit to make it easier to write parallel financial applications, compared a single core Intel Sandy Bridge CPU to one with eight cores and also to an Nvidia Tesla GPU co-processor card (with 2,688 cores) for monte-carlo simulations. Compared to the single core CPU, the multi-core configuration was 19 times as fast, and the GPU was 96 times as fast.
Importantly, Xcelerit’s test team noted that the boost sought from parallelism is highly dependent on the type of application being run, and since monte-carlo calculations lend themselves to parallel processing, are heavily compute bound and require little memory access, the gains are significant. As is the case when optimising most complex software code for a hardware platform, the devil is in the detail, and “your mileage may vary.”
But multi-core and GPUs are just the beginning of the world of parallelism, as Intel is now targeting the many core architectures – implemented in its Xeon Phi co-processor card – that it has already deployed in the world of scientific supercomputing, at financial markets applications.
Indeed, I first heard of Xeon Phi when it was deployed in significant numbers in Stampede, the latest supercomputer hosted by the University of Texas at Austin. The peak performance of Stampede, which was fully let loose in the Fall of 2013 with nearly 500,000 cores, is 8,520 teraflops, making it the seventh fastest supercomputer worldwide, according to the widely recognised “Top 500 Supercomputer” list. Moreover, Xeon Phi also underpins the Milky Way 2 machine at China’s National Supercomputer Centre. With more than three million cores peaking at 54,902 teraflops, it is currently the world’s fastest supercomputer.
The current version of Xeon Phi – codenamed Knights Corner – features 60 x86 cores, each of which can run four threads simultaneously. When Xcelerit ran its monte-carlo test on this configuration, it performed 43 times faster than the single core implementation. But there’s more to come when the Knights Landing version gets released next year, which has 72 cores, floating point and vector processing, and 384GB of on-board memory. It doesn’t take too much to figure out Intel’s trajectory.
Unsurprisingly, Intel makes much of the compatibility of Xeon Phi with its mainstream Xeon processors, and the ease of programmability and portability of applications that comes with it. It all adds up, it says, to reduced development risk, cost and and time of implementation.
Intel’s also happy that Xcelerit just released a version of its toolkit that supports Xeon Phi. Says principal engineer, Robert Geva: “The Xeon Phi coprocessor can really deliver spectacular performance to clever programmers who make good use of its cores, caches and vector processing units,” he says, adding “This Xcelerit SDK is very welcome as it opens up Phi performance to programmers who don’t have that expertise”.
Meanwhile, advocates of GPU technology point to frameworks such as the the Nvidia-backed CUDA, which provides parallel processing extensions to languages such as C, C++ and Fortran (as well as support for finance languages like Matlab and Mathematica) that ease parallel application development.
Just as Intel has for some years squared off against FPGA co-processors for such tasks as low-latency data feed handling and order-book building, it now looks set to do new battles with the GPU crowd when it comes to reducing decision latency for intelligent trading approaches. Along the way, there might also be skirmishes with other processor architectures, such as IBM’s POWER8, which supports 96 threads per chip – a battleground made more likely given Big Blue’s focus away from x86-based systems.
Pete Harris is Principal of Lighthouse Partners, an Austin, TX-based consulting company that helps innovative technology companies with their marketing endeavors. www.lighthouse-partners.com.