By Amir Halfon, Senior Director for Technology, Capital Markets, Oracle Financial Services
In my last post I briefly touched on the subject of unstructured data and schema-less repositories. In this instalment, I’d like to focus on this topic in a bit more detail, and look at the Variety aspect of Big Data from several angles. Variety refers to various degrees of structure (or lack thereof) within the source data. And while much attention has been given to loosely-structured web data – whether sourced from the web itself (social media, etc.) or from web server logs – I’d like to turn to the topic of unstructured data within the firm’s firewall. I’d also like to focus on the challenge of linking diverse data with various levels of structure, rather than discussing the storage and analysis of unstructured data as a standalone problem.
Here are a couple of examples: Firms are under growing pressure to retain all interaction records that relate to a transaction: phone calls, emails, IM, etc. Recently, especially in light of the MF Global debacle, more attention has been given to the linkage between these records and the corresponding transactions handled by trade capture systems, order management systems, and the like.
There is a growing realization within the industry that regulations such as Dodd-Frank will require this linkage to be established, and readily available for on-demand reporting. Aside from regulatory compliance, interaction records can also be quite useful for rogue trading and other fraud detection analysis once they are effectively linked to transactional data.
OTC derivatives are another interesting example; bilateral contracts in essence, they contain critical information within their legal text, which has to be deciphered in order to become usable for analytics. Much regulatory attention has been given to some of these instruments (such as swaps), in an effort to standardize them and make them more transparent, but even as we’re moving toward a central counter party model, many derivatives remain quite obscure in terms of their core data elements. This is especially true when derivation relationships have to be traversed in order to get to a complete picture of risk exposure and other aggregate data.
These and other examples make a case for the challenge, as well as the importance of integrating structured and loosely/un-structured data. With that in mind, I’d like to discuss a few enabling technical strategies:
As I mentioned in my previous post, I do not see these technologies as orthogonal, but rather as a continuum of tools that can work in concert. Many customers using map reduce and other schema-less frameworks have been struggling with combining their outputs with structural data and analytics coming from the RDBMS side. And it’s becoming clear that rather than choosing one over the other, it is the integration of the relational and non-relational paradigms that provides the most powerful analytics by bringing together the best of both worlds.
There are several technologies that enable this integration; some of them fall into the traditional ETL category, while others take advantage of the processing power of map reduce frameworks like Hadoop to perform data transformation in-place rather doing it in a separate middle tier. Some tools combine this capability with in-place transformation at the target database as well, taking advantage of the compute capabilities of engineered machines (mentioned in previous post), and using change data capture to synchronize source and target, again without the overhead of a middle tier. In both cases, the over-arching principle is ‘real-time data integration’: reflecting data changes instantly in a data warehouse – whether originating from a map reduce job or from a transactional system – so that downstream analytics have an accurate, timely view of reality.
Linked Data and Semantics
Formally, this term refers to linking disparate data sets using Semantic technology. This powerful strategy has been getting a lot of traction within the biomedical industry, but is only now gaining momentum within financial services thanks to the efforts of the EDM Council and uptake from developers.
More broadly speaking though, the notion of pointing at external sources from within a data set has been around for quite a long time, and the ability to point to unstructured data (whether residing in the file system or some external source) is merely an extension of that. Moreover, the ability to store and process XML and XQuery natively within some RDBMSs allows the combination of different degrees of structure while searching and analyzing the underlying data.
Semantic Technology takes this a step further by providing a set of formalized xml-based standards for storage, querying and manipulation of data. Because of its heritage as part of the Semantic Web vision, it is not typically associated with Big Data discussions, which in my mind is a big miss… While most ‘NoSQL’ technologies fall into the categories of key value stores, graph, or document databases, the Semantic RDF triple store provides a different alternative. It is not relational in the traditional sense, but still maintains relationships between data elements, including external ones, and does so in a flexible, extensible fashion.
A record in an RDF store is comprised of a ‘triple’, consisting of subject-predicate-object, each expressed as a URI. This is just as flexible as a document-oriented database or a key-value store in that it does not impose a relational schema on the data, allowing the addition of new elements without structural modifications to the store. Since records are essentially URIs, they can point to each other as well as to external sources. And the underlying system can resolve references using ‘reasoning’ – inferring new triples from the existing records using a set of rules. This is a powerful alternative to joining relational tables to resolve references in a typical RDBMS, while at the same time offering a more expressive way to model data than a key value store. For a detailed overview of Semantic technology, take a look at http://www.w3.org/standards/semanticweb
Lastly, one of most powerful aspects of Semantic technology came from the world of linguistics and Natural Language Processing. Referred to as Entity Extraction, it is the ability to extract pre-defined data elements (names, quantities, prices, etc.) from unstructured text. The applicability of this technology to unstructured data analytics seems quite obvious: Whether it’s sentiment analysis of web data, risk analysis of OTC contracts, or fraud detection analysis of email records, Entity Extraction and NLP provide a powerful mechanism to extract information from unstructured data and combine it with transactional data, enabling deep analytics by bringing these worlds closer together.