Bringing low latency to the world of big data is what ParStream – which recently raised $5.6 million in series A funding – has been working on now for several, years, with some impressive results. We talked to the company’s CEO Mike Hummel to find out more about the company and its technology.
Q: First, can you describe how ParStream got started, and what business problem are you looking to solve?
A: We founded ParStream in 2008 after we had identified a lack of database technology enabling real time big data applications for our customers. Four years ago, our company got a contract to build a search engine for a travel package offering. The application should be able to search through about seven billion data records against 20 parameters in less than 100 milliseconds. We tried a lot of different database technologies, also NoSQL, but nothing worked. This motivated us to develop our own technology.
Today ParStream enables customers across several industries to gain new insights from big data in real-time and it is currently used in marketing analytics, customer analytics, operations analytics, research and other scenarios.
Q: What are the benefits of choosing ParStream compared to a Hadoop approach?
A: ParStream is built for real-time analytics on big data while the data can be continuously imported. That enables the user to act on big data with low latency. ParStream gives the flexibility of a full drill down in billion datasets, there is no need to use cubes, materialised views, projections or any other form of pre-aggregation.
Newer technologies such as Google’s MapReduce, and its open-source Hadoop derivative, are able to decompose the query into many independent pieces, just like the ParStream software. But the MapReduce technology is more suited for batch-mode processing, rather than real-time analytics. ParStream’s customers had tried the MapReduce scheme and encountered those limitations. In fact, Google itself abandoned MapReduce for query-type searching.
Another benefit of ParStream is its SQL interface. Developers know SQL very well and SQL is the perfect interface to integrate analytic tools. Therefore, ParStream can be easily adopted and integrated in the given infrastructure.
Q: What are the principal technology approaches that you’re leveraging to perform analytics with low latency? And how do they complement one another?
A: As suggested by its name, the ParStream software performs parallel streaming of data structures. The technology is ideal for very large amounts of structured and semi-structured data – the database can have thousands of columns and billions of rows. The secret is to parallelise each query such that it can be processed simultaneously on many cores spread across multiple nodes. In a cluster environment, the data is sharded, i.e. stored on individual servers in a “shared nothing” environment. As ParStream processes data locally on nodes, there is very little data traffic between the servers. This is the reason ParStream’s performance can scale linearly with the cluster size; doubling processors or nodes doubles throughput.
But it’s not just about query parallelisation though. ParStream’s real secret sauce is our index structure. We invented a unique indexing technology, the High Performance Compressed Index (HPCI), which allows an ultra-fast search in compressed bitmap indices. This unique approach has been gaining recognition for ParStream, including recently being named a “Cool Vendor” by Gartner Research.
Q: What part do GPUs play in your architecture?
A: From the start we have optimised ParStream to a massively parallel architecture and we use bitmap indices. Both innovations bring a performance boost on modern CPUs but GPUs benefit from this technology even more. The ParStream architecture and technology gives our customers the advantage of using CPUs today with the opportunity to switch to GPUs later if they like. Today our customers prefer running ParStream on CPUs, in other words commodity hardware, because it is easier and cheaper to integrate in their data center.
Q: What kind of performance is achievable using ParStream, and how does the deployment model – customer infrastructure, cloud, appliance – impact that?
A: ParStream delivers sub-second response times on billios of data records whilst continuously importing new data at very high speed.
We have built solutions with ParStream that range from real-time bidding with average response times of seven milliseconds on the AWS cloud up to live-segmentation of web-click dat,a which needs multi-stage processing that executes in about two seconds on 10 billion records. We all know that query response time depends on query type, data structure, data volume, etc. but these response times are typical.
ParStream is a software-only product that is available for many Linux distributions. ParStream can be operated on a single server, a dedicated server-cluster or on virtualised infrastructures, like private or public clouds. Of course, dedicated infrastructure provides some performance gain over virtualised environments, but ParStream runs very well on virtualised environments like AWS as well.
Q: How has ParStream been adopted for financial markets applications?
A: The ability to act fast on latest data is essential in financial markets. ParStream is made for analytics in real time with low latency. In other words: ParStream is developed for the requirements of financial market applications with no need for extra adoption.
Q: What’s coming next from ParStream in terms of company growth, technology directions, products?
A: We are hiring at our locations in Palo Alto and in Cologne, Germany! So please give us a ring if you want to be part of the team, exploring the limits of technology regarding real-time analytics.
Several partnerships are going to be announced within the next month. This will show that ParStream is an integral part of the big data movement. As ParStream will continue to focus on its database technology, together with its partner network integrated real-time solutions can be offered.
Regarding our technology, we want to make it easier to use and better integrated with the Hadoop and the OLTP eco-system. Furthermore we are planning to build in some further secret-sauce … psst.