HPCC Systems – a unit of LexisNexis – has been quietly building out a big data processing platform for several years, using it internally to power several of its parent’s services and applications. Now, it’s open sourcing the not-map reduce platform, and selling a commercial version to third parties. We got the detail from Dr. Flavio Villanustre, vice president of infrastructure and products, at the company.
Q: How did HPCC Systems get formed, and what is its relationship to LexisNexis?
A: HPCC Systems is an initiative by LexisNexis Risk Solutions to release its core distributed data-intensive technology platform under an Apache open source license. This HPCC Systems platform was developed by LexisNexis Risk Solutions, a leader in providing information that helps customers across all industries and government predict, assess and manage risk. LexisNexis Risk Solutions is a $1.4 billion division of Reed Elsevier, which is a leading publisher and information provider serving customers in over 100 countries with more than 30,000 employees worldwide.
As a leading information provider, LexisNexis has more than 35 years experience in managing big data, from publicly available information such as worldwide newspapers, magazines, articles, research, case law, legal regulations, periodicals, and journals – to public records such as bankruptcies, liens, judgments, real estate records – to other types of information. Today, LexisNexis has several petabytes of data that come in from about 20,000 different sources, including structured, semi structured and unstructured data.
To manage, sort, link, and analyse billions of records within sub-seconds, LexisNexis designed a data-intensive supercomputer built on its own high performing computing cluster (HPCC) platform, proven for the past 10 years with customers who need to sort through billons of records. Customers such as leading banks, insurance companies, utilities, law enforcement and federal government depend on LexisNexis technology and information solutions to help them make better decisions faster.
Q: What is the business mission for HPCC Systems?
A: The business mission of HPCC Systems is to create awareness about the technology to keep it relevant, find new use cases and spread this technology, build a community of adopters and expand the talent pool of developers to work with ECL, the open programming dialect called Enterprise Control Language.
Q: What is the technology that’s been developed – such as Thor and Roxie?
A: Designed to manage the most complex and data-intensive analytical problems, the HPCC Systems platform can process, analyse, and find links and associations in high volumes of complex data significantly faster and more accurately than alternative technology systems. HPCC Systems scales linearly from tens to thousands of nodes handling many petabytes, supporting millions or billions of transactions per day. HPCC Systems delivers, on a single platform, a single architecture and a single programming language for efficient processing and operation.
HPCC Systems consists of three main components:
ECL – At the core of the HPCC Systems is the Enterprise Control Language (ECL), which is a declarative, data-centric programming language optimised for large-scale data management and query processing, to automatically manage workload distribution across all nodes. The programming language allows data analysts and data scientist to define all aspects of the massive data transformations such as joins, sorts and index builds.
Thor – is the Data Refinery Cluster designed to execute big data workflows, including extraction, loading, cleansing, linking and indexing. It is suitable for massive joins/merges, massive sorts and transformations and any N2problem. Example: Thor could identify and catalog all the DNA in the oceans.
Roxie – is the Rapid Data Delivery Cluster and provides high-performance online query delivery for big data. Roxie utilises highly optimised distributed B-tree indexed data structures and has been built for highly available high concurrent use. A typical 10-node cluster can process thousands of concurrent requests and deliver them in fractions of a second. Roxie allows indices to be built onto data for efficient multi-user retrieval of data. It is suitable for volumes of structured queries and full text ranked Boolean search. Example: Roxie can help you find a specific fish in the ocean.
Q: What does this technology provide that approaches like Hadoop do not?
A: Proven and in production for 10 years: HPCC Systems is a proven and complete platform that has been in production environments for over 10 years. It helped LexisNexis Risk Solutions Division scale to a $1.4 billion information solutions company. It was originally built to help LexisNexis with its own big data processing workflows, and was recently open-sourced for the reasons described above.
Complete and streamlined platform: HPCC Systems basically has main components (Thor, Roxie) and a high level, succinct, highly productive and complete programming language (ECL). HPCC Systems runs on commodity off the shelf hardware and was built to help small development teams develop products using iterative agile strategies – so maximises the efficiency, eliminating the need for larger number of low level java developers.
Not Map Reduce: HPCC Systems is not based on the map-and-reduce process. Instead, it leverages transformative data graphs. Many complex data problems require a series of advanced functions to solve them. With HPCC Systems technology, complex data challenges can be represented naturally with a transformative data graph. The nodes of the data graph can be processed in parallel as distinct data flows.
Q: What does HPCC actually offer – to the open source community, and as a commercial product?
A: The Community Edition provides source code through a versioning control system, and periodic binary snapshots at important milestones, with support being provided by other community members (as in any other open source project);
The Enterprise Edition is an enterprise-ready set of binary installable packages, together with additional documentation, support, indemnity, training and binary only modules normally required for more sophisticated data processing. There are annual license fees for the Enterprise Edition and its support. Additional software modules, value-added services, and a turnkey offering (hardware, software, and services) are also available.
Q: How widely used are your offerings? How is your technology being used in the financial services sector?
A: The HPCC Systems platform is leveraged in two ways – through traditional LexisNexis data services and analytics products that sit on top of the platform, and HPCC Systems is also available as a stand-alone option.
For example, LexisNexis products for industries such as insurance, health care, government, financial services, telecommunications and retail leverage the HPCC Systems platform. For insurance, LexisNexis Risk Solutions helps insurers assess their risk and streamline the underwriting process in 99% of all U.S. auto insurance claims and more than 90% of all homeowner claims. LexisNexis C.L.U.E. Auto, the industry standard loss underwriting database for the U.S. auto insurance market, represents a 99.6 percent industry contribution.
In the financial sector, LexisNexis Risk Solutions helps 50 of the top 50 U.S. banks prevent crime, achieve regulatory compliance and mitigate business risk.
Retail customers use our tools to predict and prevent fraud, while health care professionals use them to help combat fraud, waste and abuse.
LexisNexis Risk Solutions assists 70% of local government and almost 80% of federal agencies in the U.S. to safeguard citizens and reduce financial losses. The company’s flagship product, Accurint (which also leverages the HPCC Systems platform), provides investigators the information and analysis they need to quickly and confidently work their cases.
As a stand-alone option, HPCC Systems is used by various companies and organisations to leverage insight from big data, such as a large computer manufacturer, a leading business information company, a leading IT networking company, credit risk organisations, and data analytics companies, such as Opera Solutions.
Q: What’s next in the development of HPCC – both in terms of technology and business?
A: As with most open source projects, the HPCC Systems initiative has a public and very rich roadmap covering a number of areas spanning from incremental improvements to significant breakthroughs in the data-intensive computing arena. Some of the immediate projects are dedicated to extend current functionality and enhance usability, while the more strategic initiatives involve applied research in hybrid multicore hardware technologies, integration with other systems, the introduction of even higher level of abstraction through Domain Specific Languages for graph processing and more.
From a business perspective, the HPCC Systems teams has a multi-prong approach 1) Leverage the existing LexisNexis customer base on traditional products to up-sell and cross-sell the HPCC Systems to enterprise customers who have a large or complex data challenge 2) Cultivate a network of partners to leverage their reach to other markets, especially international markets 3) Pursue new customers through increasing HPCC System’s presence in the technology and big data space.