The leading knowledge platform for the financial technology industry
The leading knowledge platform for the financial technology industry

A-Team Insight Blogs

Opinion: Size Isn’t Everything – Handling Big Data Variety

By Amir Halfon, Senior Director for Technology, Capital Markets, Oracle Financial Services

In my last post I briefly touched on the subject of unstructured data and schema-less repositories. In this instalment, I’d like to focus on this topic in a bit more detail, and look at the Variety aspect of Big Data from several angles. Variety refers to various degrees of structure (or lack thereof) within the source data. And while much attention has been given to loosely-structured web data – whether sourced from the web itself (social media, etc.) or from web server logs – I’d like to turn to the topic of unstructured data within the firm’s firewall. I’d also like to focus on the challenge of linking diverse data with various levels of structure, rather than discussing the storage and analysis of unstructured data as a standalone problem.

Here are a couple of examples: Firms are under growing pressure to retain all interaction records that relate to a transaction: phone calls, emails, IM, etc. Recently, especially in light of the MF Global debacle, more attention has been given to the linkage between these records and the corresponding transactions handled by trade capture systems, order management systems, and the like.

There is a growing realization within the industry that regulations such as Dodd-Frank will require this linkage to be established, and readily available for on-demand reporting. Aside from regulatory compliance, interaction records can also be quite useful for rogue trading and other fraud detection analysis once they are effectively linked to transactional data.

OTC derivatives are another interesting example; bilateral contracts in essence, they contain critical information within their legal text, which has to be deciphered in order to become usable for analytics. Much regulatory attention has been given to some of these instruments (such as swaps), in an effort to standardize them and make them more transparent, but even as we’re moving toward a central counter party model, many derivatives remain quite obscure in terms of their core data elements. This is especially true when derivation relationships have to be traversed in order to get to a complete picture of risk exposure and other aggregate data.

These and other examples make a case for the challenge, as well as the importance of integrating structured and loosely/un-structured data. With that in mind, I’d like to discuss a few enabling technical strategies:

SQL-NoSQL Integration

As I mentioned in my previous post, I do not see these technologies as orthogonal, but rather as a continuum of tools that can work in concert. Many customers using map reduce and other schema-less frameworks have been struggling with combining their outputs with structural data and analytics coming from the RDBMS side. And it’s becoming clear that rather than choosing one over the other, it is the integration of the relational and non-relational paradigms that provides the most powerful analytics by bringing together the best of both worlds.

There are several technologies that enable this integration; some of them fall into the traditional ETL category, while others take advantage of the processing power of map reduce frameworks like Hadoop to perform data transformation in-place rather doing it in a separate middle tier. Some tools combine this capability with in-place transformation at the target database as well, taking advantage of the compute capabilities of engineered machines (mentioned in previous post), and using change data capture to synchronize source and target, again without the overhead of a middle tier. In both cases, the over-arching principle is ‘real-time data integration’: reflecting data changes instantly in a data warehouse – whether originating from a map reduce job or from a transactional system – so that downstream analytics have an accurate, timely view of reality.

Linked Data and Semantics

Formally, this term refers to linking disparate data sets using Semantic technology. This powerful strategy has been getting a lot of traction within the biomedical industry, but is only now gaining momentum within financial services thanks to the efforts of the EDM Council and uptake from developers.

More broadly speaking though, the notion of pointing at external sources from within a data set has been around for quite a long time, and the ability to point to unstructured data (whether residing in the file system or some external source) is merely an extension of that. Moreover, the ability to store and process XML and XQuery natively within some RDBMSs allows the combination of different degrees of structure while searching and analyzing the underlying data.

Semantic Technology takes this a step further by providing a set of formalized xml-based standards for storage, querying and manipulation of data. Because of its heritage as part of the Semantic Web vision, it is not typically associated with Big Data discussions, which in my mind is a big miss… While most ‘NoSQL’ technologies fall into the categories of key value stores, graph, or document databases, the Semantic RDF triple store provides a different alternative. It is not relational in the traditional sense, but still maintains relationships between data elements, including external ones, and does so in a flexible, extensible fashion.

A record in an RDF store is comprised of a ‘triple’, consisting of subject-predicate-object, each expressed as a URI. This is just as flexible as a document-oriented database or a key-value store in that it does not impose a relational schema on the data, allowing the addition of new elements without structural modifications to the store. Since records are essentially URIs, they can point to each other as well as to external sources. And the underlying system can resolve references using ‘reasoning’ – inferring new triples from the existing records using a set of rules. This is a powerful alternative to joining relational tables to resolve references in a typical RDBMS, while at the same time offering a more expressive way to model data than a key value store. For a detailed overview of Semantic technology, take a look at

Lastly, one of most powerful aspects of Semantic technology came from the world of linguistics and Natural Language Processing. Referred to as Entity Extraction, it is the ability to extract pre-defined data elements (names, quantities, prices, etc.) from unstructured text. The applicability of this technology to unstructured data analytics seems quite obvious: Whether it’s sentiment analysis of web data, risk analysis of OTC contracts, or fraud detection analysis of email records, Entity Extraction and NLP provide a powerful mechanism to extract information from unstructured data and combine it with transactional data, enabling deep analytics by bringing these worlds closer together.

Related content


Upcoming Webinar: The post-Brexit UK sanctions regime – how to stay safe and compliant

Date: 11 March 2021 Time: 10:00am ET / 3:00pm London / 4:00pm CET Duration: 50 minutes When the Brexit transition period came to an end on 31 December 2020, a new sanctions regime was introduced in the UK under legislation set out in the Sanctions and Anti-Money Laundering Act 2018 (aka the Sanctions Act). The...


Two Golden Sources: How Will the New FCA FIRDS System Hold Up Post-Brexit?

At the end of 2020 the UK’s Financial Conduct Authority (FCA) lost access to the EU’s MiFID database: instead launching its own FIRDS system. It’s a big change, and it could have a significant impact on the way firms are able to manage their transaction reporting requirements. The new FCA FIRDS database was built to...


RegTech Summit Virtual

The RegTech Summit Virtual is a global online event that will be held in June 2021 with an exceptional guest speaker line up of RegTech practitioners, regulators, start-ups and solution providers to collaborate and discuss innovative and effective approaches for building a better regulatory environment.


Regulatory Data Handbook 2020/2021 – Eighth Edition

This eighth edition of A-Team Group’s Regulatory Data Handbook is a ‘must-have’ for capital markets participants during this period of unprecedented change. Available free of charge, it profiles every regulation that impacts capital markets data management practices giving you: A detailed overview of each regulation with key dates, data and data management implications, links to...