Earlier this summer, AMD achieved a world-record STAC-T0 benchmark result1 with its FPGA designed for ultra-low latency trading, the AMD Alveo UL3524 FinTech accelerator card, and has now just launched a new card with equivalent performance but half the size, geared towards even broader adoption among high-frequency trading firms.
STAC benchmarks are the accepted industry standard for evaluating the performance of electronic trading products. The STAC-T0 benchmark measures a trading platform’s tick-to-trade network-I/O latency, with its most important statistic being ‘Actionable Latency,’ which is defined as the time interval between the last bit of inbound data needed to make a trading decision and the first bit of the simulated outbound order.
In partnership with trading and execution systems specialist Exegy, AMD achieved the fastest recorded STAC-T0 benchmark result to date, by demonstrating a minimum of 13.9ns actionable latency, reducing tick-to-trade latency by up to 49% compared to the previous record of 24.2ns (which was also achieved with AMD Alveo accelerators).1
These results are significant. Today, FPGAs are frequently used for ultra-low latency trade execution. And while a few top HFT firms have heavily invested in specialised ASICs (application-specific integrated circuits) to achieve deterministic latency at nanosecond speed, the landscape is evolving. HFT firms are adopting ultra-low latency FPGA accelerators for ASIC-like performance.
FPGAs in electronic trading – a brief history
FPGAs (Field-Programmable Gate Arrays) were initially adopted by trading firms in the early 2000s, due to their capacity for parallel processing, which helped early adopters manage the increasing volumes of data in electronic trading environments. With hardware architecture able to deliver much lower – and more deterministic – latency than software-based solutions, and with reconfigurability making them ideal for implementing custom algorithms, the mid-2000s witnessed FPGAs being deployed by a proliferation of HFT firms looking to gain competitive advantage through faster execution times.
By the late 2000s, specialist vendors began to offer FPGA-based ticker plants to process market data feeds and to launch solutions that could run in-line pre-trade risk checks on FPGAs, to speed up order flow. The mid-2010s saw chipmakers Xilinx (acquired by AMD in 2022) and other industry players tailoring FPGA products specifically for financial trading applications.
More recently, FPGA innovations have included improved development tools, more accessible programming languages, and libraries and frameworks that abstract some of the complexities of FPGA development. The AMD Alveo ultra-low latency card, for example, comes with comprehensive software support via the AMD Vivado Design Suite, which offers a variety of reference designs and benchmarks to help customers develop new applications for the platform.
Today, FPGAs continue to evolve, with an increasing focus on AI and machine learning applications. The AMD Alveo FinTech accelerators underline this trend by supporting the FINN development framework, an open-source tool for deep neural network inference on FPGAs.
Where FPGAs add value to the trading stack
Within high-speed electronic trading, there are numerous areas where the use of hardware acceleration can improve performance. Any type of arbitrage trading that capitalises on short-lived price discrepancies between different markets or instruments, for example, requires the deterministic speed that can typically only be achieved through hardware. Electronic market-making and liquidity provision, particularly in asset classes such as equities, exchange-traded derivatives (ETDs), FX, and increasingly crypto, are also areas where hardware acceleration can provide a significant advantage.
Market data direct feed handling functions, which are typically exchange-specific, are much more efficiently performed by programmable hardware than software. Even if the trading logic itself resides in software, the FPGA can perform inline processing and filtering of the data stream, preparing the data in real-time at wire speed before passing relevant fields to the CPU or another suitably configured FPGA for trading functions.
FPGAs can also perform line arbitration, reconstructing a reliable feed from two unreliable feeds. This is particularly useful when receiving feeds via microwave with fibre backup, for example. With this process being complex due to very small-time offsets, hardware handles it exceptionally well.
Pre-trade risk management is another area where FPGAs can be used, for real-time limit checking. Although only limited real-time checks can be performed, given that this process operates at nanosecond speeds, trades can be blocked based on thresholds and policies set by a supervisory system, to prevent over-exposure to a particular instrument or asset class, for example. Additionally, the FPGA can act as a ‘cow catcher,’ detecting and preventing trading anomalies such as erroneous or disruptive trades, before they impact the market.
Firms don’t necessarily have to program such functions themselves. Specialist vendors now offer commercially available IP blocks for FPGAs that satisfy the pre-trade risk requirements of various exchanges and regulations, including SEC 15C35 in the US and MiFID II in Europe.
Price/performance considerations
For all of the processes described above, software-based solutions for HFT generally provide a lower cost option than dedicated hardware and offer more flexibility in terms of programming, but they also come with lower performance. Even highly optimized software solutions, such as kernel bypass, typically incur latencies of around three to five microseconds due to the time required to access the server CPU and perform the necessary processing. Additionally, software solutions are heavy on resources, resulting in variable latency and poor scalability under increasing loads. While such solutions are adequate for the majority of market segments – the world of algorithmic trading is in fact dominated by software-based solutions – they certainly cannot match the speed of hardware.
At the other end of the scale, for firms considering a hardware-based approach, ASICs offer extremely high performance but are also the most rigid and expensive to deploy. ASIC development is a highly complex and costly process, and once an ASIC has been programmed and manufactured, it is fixed and cannot be re-programmed; ASICs are custom-built for specific tasks or applications, and their circuitry is hardwired during the manufacturing process, making them immutable. While some top HFT firms are willing to bear such costs and rigidity for the marginal advantage it gives them, only a few such firms exist.
FPGA-based solutions fall somewhere in the middle. From a performance perspective, FPGA solutions offer deterministic latency, typically challenging to achieve in software, and as the Alveo accelerator demonstrates – can now achieve performance at levels where ASIC solutions were typically required for certain tasks, such as serial I/O and Ethernet processing.
Although FPGA devices and the associated development teams and hardware engineers can prove more costly than smart NICs and software engineers, programming/re-programming an FPGA is a far less expedient proposition than manufacturing/re-manufacturing an ASIC. They also offer much greater flexibility. Electronic trading is highly dynamic. Trading houses constantly develop new algorithms, and unlike ASICs, FPGAs can be re-programmed multiple times to cater to this. Additionally, FPGAs benefit from a wide ecosystem of vendors providing IP and logic blocks to handle specific functions, allowing trading firms to purchase such components and concentrate their development efforts on the algorithms themselves.
Advantages of the AMD Alveo Accelerator Cards
All of this brings us back to the AMD Alveo ULL accelerator cards.
“The real innovation within the Alveo ultra-low latency accelerator and slimmer UL3422 card lies in its new transceiver and hardened silicon with AMD proprietary IP to handle high-speed data input, such as data coming from an exchange over an Ethernet connection, for example. The faster this data is processed into the custom logic of the FPGA chip, the lower the latency for tick-to-trade operations,” says Girish Malipeddi, Director of Product Management and Marketing at AMD, in conversation with TradingTech Insight. “GT stands for gigabit transceiver, which provides high-speed serial I/O on the device itself. The key difference with the Alveo ultra-low latency accelerator is that it features a newer, lower latency ‘GTF’ transceiver2, whereas our previous devices – which most existing customers in this space are using – have the earlier ‘GTY’ transceiver.”
He continues: “The Alveo UL3422 accelerator’s ultra-low latency reduces the transceiver latency from 16 nanoseconds to approximately 2.34 nanoseconds2. Moreover, the standard Ethernet PCS and MAC IP are integrated into the transceiver and clocked at 1.2 GHz, which is not feasible on the FPGA fabric itself. This contrasts with previous devices using the GTY transceiver, where the MAC and PCS layer would need to be built in soft logic on the FPGA itself and clocked at around 600 MHz. On the Alveo ultra-low latency card, having the MAC and PCS within the transceiver pin of the chip enables ASIC-like performance, making it exceptionally efficient.”
For customers already using AMD FPGA solutions, transitioning to the ultra-low latency Alveo card is a straightforward evolution, says Malipeddi. “The design and deployment method is pretty much the same as with other FPGA-based cards,” he says. “This means FPGA users achieve high performance without disrupting their design and implementation methodology. Additionally, re-spinning an ASIC adds cost, whereas an FPGA is reprogrammable, making updates cost-efficient vs. manufacturing a new ASIC-based solution.”
Building on the success of the larger AMD Alveo UL3524 accelerator card, AMD released the Alveo UL3422 card in early October. Designed as a slimmer, cost-efficient solution that maintains the core latency benefits of the Alveo UL3524, the Alveo UL3422 accelerator card introduces several key differences. With a smaller form factor—full height half length, compared to the full height three-quarters length Alveo UL3524 card — the new Alveo UL3422 accelerator card is compatible with a much wider range of servers, a critical consideration for firms looking to maximize their available space in colocation facilities.
While the Alveo UL3524 is geared for scenarios that require maximum connectivity, with 32 Ethernet ports and 32 expansion slots, the Alveo UL3422 accelerator card offers 16 Ethernet ports and 16 expansion slots, making it ideal for applications that do not demand the same level of port density, such as certain layer 1 cross-connect use cases.
Levelling the playing field
For latency-dependent trading firms that already utilise FPGA hardware acceleration in their trading technology stack – or are considering doing so – creating a future-ready environment should be an essential consideration, advises Malipeddi.
“For FPGA-based trading to be competitive, it is essential to use cutting-edge technology with longevity,” he says. “The Alveo UL3422 card represents next-generation technology. As a result, firms will be able to maximize its use time. The Alveo UL3422 card levels the playing field, and has the same horsepower in the engine as the Alveo UL3524 accelerator card, regardless of the size. That means smaller players can compete with the bigger players, based on the quality of their algorithms.”
With the addition of the Alveo UL3422 card, the AMD Alveo portfolio offers a broad range of options that cater to customers’ different scales of deployment and requirements, from top-tier high-frequency trading firms to smaller players entering the market. The Alveo UL3422 card provides a low-cost entry point while still delivering cutting-edge latency performance, making it accessible to firms of all sizes that want to stay competitive in the ultra-low latency trading space.
Looking forward, as further latency improvements become increasingly marginal and harder to achieve, the introduction of products like the Alveo UL3422 card shows how AMD is pushing the boundaries of FPGA technology to meet diverse market needs. This evolution could signal a shift in the market, as more firms, regardless of size, can afford to leverage cutting-edge hardware acceleration to stay competitive. Will the growing emphasis on AI and machine learning, coupled with a more inclusive hardware ecosystem, indeed pave the way for new trading models, such as AI/ML-supervised algorithmic trading, for example? Only time will tell.
This article is sponsored by AMD
1. The 2024 AMD world record for latency is based on 3rd party testing commissioned by AMD and Exegy, by Strategic Technology Analysis Center, LLC (STAC®) in April 2024, using the STAC-T0 benchmark to test the AMD Alveo UL3524 accelerator card powered by the AMD Virtex Ultrascale+ VU2P FPGA, running on the Exegy nxFramework and Exegy nxTCP-UDP-10g-ULL IP Core, in a Dell PowerEdge R7525 server with AMD EPYC 7313 processors. See https://stacresearch.com/news/AMD240422 for the full STAC report. AMD holds the previous world record for latency (2020): https://www.stacresearch.com/news/XLX200514. Stated results for the Alveo UL3524 accelerator have been extrapolated to the AMD Alveo UL3422 card, based on identical silicon and product features (ALV-20).
Subscribe to our newsletter