The leading knowledge platform for the financial technology industry
The leading knowledge platform for the financial technology industry

A-Team Insight Blogs

Opinion: Size Isn’t Everything – Handling Big Data Variety

By Amir Halfon, Senior Director for Technology, Capital Markets, Oracle Financial Services

In my last post I briefly touched on the subject of unstructured data and schema-less repositories. In this instalment, I’d like to focus on this topic in a bit more detail, and look at the Variety aspect of Big Data from several angles. Variety refers to various degrees of structure (or lack thereof) within the source data. And while much attention has been given to loosely-structured web data – whether sourced from the web itself (social media, etc.) or from web server logs – I’d like to turn to the topic of unstructured data within the firm’s firewall. I’d also like to focus on the challenge of linking diverse data with various levels of structure, rather than discussing the storage and analysis of unstructured data as a standalone problem.

Here are a couple of examples: Firms are under growing pressure to retain all interaction records that relate to a transaction: phone calls, emails, IM, etc. Recently, especially in light of the MF Global debacle, more attention has been given to the linkage between these records and the corresponding transactions handled by trade capture systems, order management systems, and the like.

There is a growing realization within the industry that regulations such as Dodd-Frank will require this linkage to be established, and readily available for on-demand reporting. Aside from regulatory compliance, interaction records can also be quite useful for rogue trading and other fraud detection analysis once they are effectively linked to transactional data.

OTC derivatives are another interesting example; bilateral contracts in essence, they contain critical information within their legal text, which has to be deciphered in order to become usable for analytics. Much regulatory attention has been given to some of these instruments (such as swaps), in an effort to standardize them and make them more transparent, but even as we’re moving toward a central counter party model, many derivatives remain quite obscure in terms of their core data elements. This is especially true when derivation relationships have to be traversed in order to get to a complete picture of risk exposure and other aggregate data.

These and other examples make a case for the challenge, as well as the importance of integrating structured and loosely/un-structured data. With that in mind, I’d like to discuss a few enabling technical strategies:

SQL-NoSQL Integration

As I mentioned in my previous post, I do not see these technologies as orthogonal, but rather as a continuum of tools that can work in concert. Many customers using map reduce and other schema-less frameworks have been struggling with combining their outputs with structural data and analytics coming from the RDBMS side. And it’s becoming clear that rather than choosing one over the other, it is the integration of the relational and non-relational paradigms that provides the most powerful analytics by bringing together the best of both worlds.

There are several technologies that enable this integration; some of them fall into the traditional ETL category, while others take advantage of the processing power of map reduce frameworks like Hadoop to perform data transformation in-place rather doing it in a separate middle tier. Some tools combine this capability with in-place transformation at the target database as well, taking advantage of the compute capabilities of engineered machines (mentioned in previous post), and using change data capture to synchronize source and target, again without the overhead of a middle tier. In both cases, the over-arching principle is ‘real-time data integration’: reflecting data changes instantly in a data warehouse – whether originating from a map reduce job or from a transactional system – so that downstream analytics have an accurate, timely view of reality.

Linked Data and Semantics

Formally, this term refers to linking disparate data sets using Semantic technology. This powerful strategy has been getting a lot of traction within the biomedical industry, but is only now gaining momentum within financial services thanks to the efforts of the EDM Council and uptake from developers.

More broadly speaking though, the notion of pointing at external sources from within a data set has been around for quite a long time, and the ability to point to unstructured data (whether residing in the file system or some external source) is merely an extension of that. Moreover, the ability to store and process XML and XQuery natively within some RDBMSs allows the combination of different degrees of structure while searching and analyzing the underlying data.

Semantic Technology takes this a step further by providing a set of formalized xml-based standards for storage, querying and manipulation of data. Because of its heritage as part of the Semantic Web vision, it is not typically associated with Big Data discussions, which in my mind is a big miss… While most ‘NoSQL’ technologies fall into the categories of key value stores, graph, or document databases, the Semantic RDF triple store provides a different alternative. It is not relational in the traditional sense, but still maintains relationships between data elements, including external ones, and does so in a flexible, extensible fashion.

A record in an RDF store is comprised of a ‘triple’, consisting of subject-predicate-object, each expressed as a URI. This is just as flexible as a document-oriented database or a key-value store in that it does not impose a relational schema on the data, allowing the addition of new elements without structural modifications to the store. Since records are essentially URIs, they can point to each other as well as to external sources. And the underlying system can resolve references using ‘reasoning’ – inferring new triples from the existing records using a set of rules. This is a powerful alternative to joining relational tables to resolve references in a typical RDBMS, while at the same time offering a more expressive way to model data than a key value store. For a detailed overview of Semantic technology, take a look at

Lastly, one of most powerful aspects of Semantic technology came from the world of linguistics and Natural Language Processing. Referred to as Entity Extraction, it is the ability to extract pre-defined data elements (names, quantities, prices, etc.) from unstructured text. The applicability of this technology to unstructured data analytics seems quite obvious: Whether it’s sentiment analysis of web data, risk analysis of OTC contracts, or fraud detection analysis of email records, Entity Extraction and NLP provide a powerful mechanism to extract information from unstructured data and combine it with transactional data, enabling deep analytics by bringing these worlds closer together.

Related content


Recorded Webinar: Best practice for Regulatory Change in 2021 and beyond

How to get regulatory change management right and avoid the risks of getting it wrong The burden of regulatory change on financial firms has never been greater, leaving compliance teams under increasing pressure to ensure that changes are reviewed and acted upon in a timely manner. Technology enhancements in this space can help, allowing firms...


Financial Institutions Delusional on Quality of Regulatory Reports, ACA Study Finds

Despite finding that 87% of firms are confident in the quality of their MiFIR and/or EMIR reports, research commissioned by ACA Group has discovered that 97% of reports submitted to regulators via Approved Reporting Mechanisms (ARMs) and Trade Repositories contain inaccuracies. The research found reporting data quality to be poor, with each report containing an...


TradingTech Summit London

TradingTech Summit London will explore how trading firms are innovating in today’s cloud and digital based environment to create flexible, scalable trading platforms to support speed to market and business agility. Leveraging the cloud, AI and ML technologies to get an edge, automate processes and simplify operations in a cost effective way is the name of the game and will share practical insight from practitioners and technology leaders who are innovating and driving forward change in trading operations.


ESG Handbook 2021

A-Team Group’s ESG Handbook 2021 is a ‘must read’ for all capital markets participants, data vendors and solutions providers involved in Environmental, Social and Governance (ESG) investing and product development. It includes extensive coverage of all elements of ESG, from an initial definition and why ESG is important, to existing and emerging regulations, data challenges...