About a-team Marketing Services
The knowledge platform for the financial technology industry
The knowledge platform for the financial technology industry

A-Team Insight Blogs

Opinion: Size Isn’t Everything – Handling Big Data Variety

Subscribe to our newsletter

By Amir Halfon, Senior Director for Technology, Capital Markets, Oracle Financial Services

In my last post I briefly touched on the subject of unstructured data and schema-less repositories. In this instalment, I’d like to focus on this topic in a bit more detail, and look at the Variety aspect of Big Data from several angles. Variety refers to various degrees of structure (or lack thereof) within the source data. And while much attention has been given to loosely-structured web data – whether sourced from the web itself (social media, etc.) or from web server logs – I’d like to turn to the topic of unstructured data within the firm’s firewall. I’d also like to focus on the challenge of linking diverse data with various levels of structure, rather than discussing the storage and analysis of unstructured data as a standalone problem.

Here are a couple of examples: Firms are under growing pressure to retain all interaction records that relate to a transaction: phone calls, emails, IM, etc. Recently, especially in light of the MF Global debacle, more attention has been given to the linkage between these records and the corresponding transactions handled by trade capture systems, order management systems, and the like.

There is a growing realization within the industry that regulations such as Dodd-Frank will require this linkage to be established, and readily available for on-demand reporting. Aside from regulatory compliance, interaction records can also be quite useful for rogue trading and other fraud detection analysis once they are effectively linked to transactional data.

OTC derivatives are another interesting example; bilateral contracts in essence, they contain critical information within their legal text, which has to be deciphered in order to become usable for analytics. Much regulatory attention has been given to some of these instruments (such as swaps), in an effort to standardize them and make them more transparent, but even as we’re moving toward a central counter party model, many derivatives remain quite obscure in terms of their core data elements. This is especially true when derivation relationships have to be traversed in order to get to a complete picture of risk exposure and other aggregate data.

These and other examples make a case for the challenge, as well as the importance of integrating structured and loosely/un-structured data. With that in mind, I’d like to discuss a few enabling technical strategies:

SQL-NoSQL Integration

As I mentioned in my previous post, I do not see these technologies as orthogonal, but rather as a continuum of tools that can work in concert. Many customers using map reduce and other schema-less frameworks have been struggling with combining their outputs with structural data and analytics coming from the RDBMS side. And it’s becoming clear that rather than choosing one over the other, it is the integration of the relational and non-relational paradigms that provides the most powerful analytics by bringing together the best of both worlds.

There are several technologies that enable this integration; some of them fall into the traditional ETL category, while others take advantage of the processing power of map reduce frameworks like Hadoop to perform data transformation in-place rather doing it in a separate middle tier. Some tools combine this capability with in-place transformation at the target database as well, taking advantage of the compute capabilities of engineered machines (mentioned in previous post), and using change data capture to synchronize source and target, again without the overhead of a middle tier. In both cases, the over-arching principle is ‘real-time data integration’: reflecting data changes instantly in a data warehouse – whether originating from a map reduce job or from a transactional system – so that downstream analytics have an accurate, timely view of reality.

Linked Data and Semantics

Formally, this term refers to linking disparate data sets using Semantic technology. This powerful strategy has been getting a lot of traction within the biomedical industry, but is only now gaining momentum within financial services thanks to the efforts of the EDM Council and uptake from developers.

More broadly speaking though, the notion of pointing at external sources from within a data set has been around for quite a long time, and the ability to point to unstructured data (whether residing in the file system or some external source) is merely an extension of that. Moreover, the ability to store and process XML and XQuery natively within some RDBMSs allows the combination of different degrees of structure while searching and analyzing the underlying data.

Semantic Technology takes this a step further by providing a set of formalized xml-based standards for storage, querying and manipulation of data. Because of its heritage as part of the Semantic Web vision, it is not typically associated with Big Data discussions, which in my mind is a big miss… While most ‘NoSQL’ technologies fall into the categories of key value stores, graph, or document databases, the Semantic RDF triple store provides a different alternative. It is not relational in the traditional sense, but still maintains relationships between data elements, including external ones, and does so in a flexible, extensible fashion.

A record in an RDF store is comprised of a ‘triple’, consisting of subject-predicate-object, each expressed as a URI. This is just as flexible as a document-oriented database or a key-value store in that it does not impose a relational schema on the data, allowing the addition of new elements without structural modifications to the store. Since records are essentially URIs, they can point to each other as well as to external sources. And the underlying system can resolve references using ‘reasoning’ – inferring new triples from the existing records using a set of rules. This is a powerful alternative to joining relational tables to resolve references in a typical RDBMS, while at the same time offering a more expressive way to model data than a key value store. For a detailed overview of Semantic technology, take a look at http://www.w3.org/standards/semanticweb

Lastly, one of most powerful aspects of Semantic technology came from the world of linguistics and Natural Language Processing. Referred to as Entity Extraction, it is the ability to extract pre-defined data elements (names, quantities, prices, etc.) from unstructured text. The applicability of this technology to unstructured data analytics seems quite obvious: Whether it’s sentiment analysis of web data, risk analysis of OTC contracts, or fraud detection analysis of email records, Entity Extraction and NLP provide a powerful mechanism to extract information from unstructured data and combine it with transactional data, enabling deep analytics by bringing these worlds closer together.

Subscribe to our newsletter

Related content


Upcoming Webinar: FRTB Implementation in APAC: An industry update and what is left to do

Date: 25 May 2023 Time: 9:00am London / 4:00pm Singapore Duration: 50 minutes Fundamental Review of the Trading Book (FRTB) regulation, a set of proposals from the Basel Committee on Banking Supervision (BCBS) for a new market risk-related capital requirement for banks, is due to be implemented across APAC over the next few years. Singapore...


Corlytics Buys SparQ from ING in €5 million Deal

Dublin-based Corlytics has acquired ING SparQ, which helps financial institutions identify and implement external regulation more efficiently, in a transaction valued at €5 million. Corlytics will SparQ with its own capabilities to create a unified ‘monitoring to policy’ platform that automates the regulatory change lifecycle and allows firms to update and enforce internal policies and...


Regulatory Reporting Briefing, London

RegTech Insight (from A-Team Group) is proud to announce the launch of its Regulatory Reporting Briefing taking place in London and focusing on: Preparing for the EMIR Re-Fit


The Global LEI System – Slow but Sure

After what looked like a slow start to the summer, the initiative to establish a global standard for legal entity identifiers (LEIs) took a series of significant leaps forward during August, that appears to have put the project firmly back on track. If the marketplace felt a little reticent in June and July, it could...