Data Management Insight Knowledge Hub
In a nutshell: Data lineage traces data from source to destination, noting every move the data makes and taking into account any changes to the data during its journey for full traceability. It is critical to regulatory compliance and offers numerous business and operational benefits.
Read on in our Knowledge Hub ‘Everything you need to know’ section to understand the full details of what data lineage is all about, who it impacts, the key requirements, the technical and data challenges it presents, and the outlook.
You can also take a look at all the latest content we have related to data lineage. And you can see a listing of key vendors delivering solutions to this data and technological challenge.
Buy-side firms have developed data management capabilities exponentially over recent years, driven by regulatory requirements, but continually challenged by pressure on budgets, selecting suitable solutions, and sourcing and maintaining good quality data. Naomi Clarke, chief data officer at MSAmlin, and a speaker on a recent A-Team Group webinar sponsored by the SmartStream Reference Data Utility…
Machine learning and artificial intelligence (AI) have moved beyond the experimental stage to become core components of business strategy and investment. Key use cases include risk management, performance analysis and trading idea generation, ahead of automation and cutting costs. The whole is driven by growing numbers of data scientists employed by financial organisations. The barriers…
Innovation in data management can improve analytics, generate revenue potential, reduce costs, help companies monetise data and support digital transformation – but there are caveats, including the challenges of legacy systems, lack of budget and skilled resources, and cultural resistance. A recent A-Team Group webinar discussing how to leverage innovation in data management covered these…
Date: 18 June 2019 Time: 10:00am ET / 3:00pm London / 4:00pm CET. Digital transformation is a requirement to succeed in today’s information economy. As business leaders begin to acknowledge this necessity, they seek to navigate a successful path to digitisation, but they must start with a foundation of data understanding, maturity and trust –…
Is your firm struggling to get cost-efficient data management in place to meet both strategy and regulatory requirements? Could innovative data management help to deliver much-needed business growth at your firm? Are you short of resources and skills to implement best practice data management? If your answer to any of these questions is yes, you…
The importance of data lineage has escalated in recent years in response to regulatory demand and increased business understanding of the benefits it can deliver. Like all capital markets technology, data lineage presents both challenges and opportunities, so how best can it be implemented and sustained? And how can your organisation reap the rewards of…
Data lineage has become a critical concern for data managers in capital markets as it is key to both regulatory compliance and business opportunity. The regulatory requirement for data lineage kicked in with BCBS 239 in 2016 and has since been extended to many other regulations that oblige firms to provide transparency and a data…
Welcome to our latest handbook on data lineage, a critical concern for data managers working to achieve regulatory compliance, deliver operational gains, and provide meaningful value to the business. The handbook covers the complete scope of data lineage, with a view to helping you win management buy-in and budget, decide whether to build or buy…
Welcome to the fourth edition of A-Team Group’s Entity Data Management Handbook sponsored by entity data specialist Bureau van Dijk, a Moody’s Analytics company. As entity data takes a central role in business strategies dedicated to making the customer experience markedly better, this handbook delves into the detail of everything you need to do to…
Everything you need to know about: Data Lineage
What is data lineage?
Data lineage covers the lifecycle of data, from its origins, through to what happens to the data when it is processed by different systems, and where it moves from and to over time. It can be applied to most types of data and systems, and is particularly valuable in complex, high volume data environments. It is also a key element of data governance, providing an understanding of where data comes from, how systems process the data, how it is used and by whom.
The importance of data lineage has escalated in recent years in response to increasing regulatory demand where regulators are demanding full transparency and audit trails of the data behind all trading decisions.
But over time firms have come to understand the value and benefits it can deliver. Acceleration of automation has also advanced use cases. Beyond compliance, extensive data lineage can provide operational transparency and reduce risk and costs. From a business perspective, data lineage can improve data quality and allow the business to make better decisions and spot new business opportunities and strategies.
Data lineage is often represented visually to show the movement of data from source to destination, changes to the data and how it is transformed by processes or users as it moves from one system to another across an enterprise, and how it splits or converges after each move. Visualisation can demonstrate data lineage at different levels of granularity, perhaps at a high level providing data lineage that shows which systems data interacts with before it reaches its destination. As granularity increases, it becomes possible to provide detail around the particular data, such as its attributes and the quality of the data at specific points in the lineage.
By building a picture of how data flows through an organisation and is transformed from source to destination, it is possible to create complete audit trails of data points, an aspect of lineage that has become increasingly necessary to meeting regulatory requirements and ensuring data integrity for the business.
The necessary scope of data lineage can be determined by regulatory requirements, enterprise data management strategy, data impact and critical data elements. It is not necessary to boil the ocean – instead, best practice identifies regulatory requirements and business processes to which the application of data lineage is beneficial.
Who is involved in data lineage?
Reflecting the regulatory compliance and business uses cases of data lineage, related job titles include:
- Business analyst
- Business intelligence developer
- Compliance officer
- Data analyst
- Data architect
- Data governance analyst
- Data modeller
- Data quality analyst
- Solutions architect
Regulations driving adoption
Data lineage was initially implemented by financial institutions to track data across individual data management projects. It rose to prominence and became part of the regulatory landscape following the implementation of BCBS 239 in January 2016, a Basel Committee on Banking Supervision (BCBS) rule designed to improve data aggregation and reporting across financial markets, as well as accountability for data.
These requirements were the early drivers of improved data lineage, which has since been reinforced by a number of regulations that require firms to implement lineage to demonstrate exactly how they came to the results published in regulatory reports. Data lineage allows firms to not only prove the validity of report entries, but also take a proactive approach to identifying and fixing any gaps in required data.
Regulatory requirement: Basel Committee on Banking Supervision rule 239 (BCBS 239) came into force on January 1, 2016 and is designed to improve risk data aggregation and reporting. It is based on 14 principles that underpin accurate risk aggregation and reporting in normal times and times of crisis. To achieve compliance, banks must capture risk data across the organisation, establish consistent data taxonomies, and store data in a way that makes it easily accessible and straightforward to understand.
Data lineage response: Data lineage must be implemented to support risk aggregation, data accuracy and reporting, and conversely, to ensure risk data can be traced back to its origin and risk reports can be defended.
Regulatory requirement: General Data Protection Regulation (GDPR) is an EU data privacy regulation that came into force on May 25, 2018. It is designed to harmonise data privacy laws across Europe and protect EU citizens’ data privacy. The requirements of GDPR include gaining explicit consent to process personal data, giving data subjects access to their personal data, ensuring data portability, notifying authorities and individuals of data breaches, and giving individuals the right to be forgotten.
Data lineage response: Firms subject to GDPR are dependent on data lineage to track data and provide transparency about where it is and how it used. Data lineage provides firms with the ability to demonstrate compliance with the regulation and, from a data subject’s perspective, supports access to personal data and the execution of other rights such as the right to be forgotten.
Regulatory requirement: Markets in Financial Instruments Directive II (MiFID II) is a principles based directive issued by the EU. It took effect on January 3, 2018, and aims to increase transparency across Europe’s financial markets and ensure investor protection. The demand for reference and market data for both pre- and post-trade transparency, including trade reporting and transaction reporting, is unprecedented, leading to data management challenges including sourcing required data, reporting in near real-time, and uploading reference and market data to MiFID II mechanisms including Approved Publication Arrangements (APAs) and Approved Reporting Mechanisms (ARMs).
Data lineage response: MiFID II operations can benefit from data lineage in a number of ways. Lineage can be used to identify any gaps in trade reporting data, and any similarities across numerous regulatory reporting obligations. It can also be used to map MiFID II reporting data from source systems to APAs and ARMs and vice versa.
Regulatory requirement: The Comprehensive Capital Analysis and Review (CCAR) is an annual exercise carried out by the Federal Reserve to assess whether the largest bank holding companies (BHCs) operating in the US have sufficient capital to continue operations throughout times of economic and financial stress, and have robust, forward-looking capital planning processes that account for their unique risks. From a data management perspective, CCAR requires data sourcing, analytics and risk data aggregation for stress tests designed to assess the capital adequacy of BHCs and for regulatory reporting purposes.
Data lineage response: CCAR requires attribute level data lineage to track data from source to destination and ensure the validity and veracity of capital plans. Data lineage can also be used to identify any data gaps in reporting and highlight any data quality issues.
Regulatory requirement: Fundamental Review of the Trading Book (FRTB) regulation will take effect in 2022. It is a response to the 2008 financial crisis, which exposed fundamental weaknesses in the design of the trading book regime, and focuses on a revised internal model approach to market risk and capital requirements, a revised standardised approach, a shift from value at risk to an expected shortfall measure of risk, incorporation of the risk of market illiquidity, and reduced scope for arbitrage between banking and trading books.
The data management challenges of the regulation are significant and include data sourcing, facilitating capital calculations, and gathering historical data as well as real price observations for executed trades or committed quotes to meet requirements around non-modellable risk factors (NMRFs) and the linked risk factor eligibility test.
Data lineage response: To satisfy the demands of FRTB, data lineage may be needed to track historical data and trade data aggregation required for the risk factor eligibility test of NMRFs, essentially the provision of at least 24 real price observations of the value of the risk factor over the previous 12 months.
Business use cases of data lineage
Beyond regulatory compliance, data lineage offers business benefits, but it must be approached as a long-term activity rather than a point solution if it is to provide ongoing value.
Among the business benefits of successful data lineage implementation are:
Understanding data: It may sound simple, but understanding data that is used and stored across an organisation can be very difficult when it includes masses of internal data, several sources of external data, data silos and data in different formats. By applying data lineage, it is possible to gain a greater understanding of the data a company holds, where it is, what it is used for, its value and potential. With a good understanding of data, it is also possible to assign responsibility for data ownership to individuals, departments or lines of business within the organisation.
Improved business decisions: By providing access to accurate, trusted data quickly and efficiently, data lineage allows business to make smarter, faster and better informed decisions. Decisions can be made more proactively where there is data lineage and defended on the basis of being able to determine the exact data underlying a decision.
Identifying business opportunities: Using data lineage to gain a better understanding of data, and to visualise data and processes, can provide new business opportunities, such as the potential to create new products by combining certain data and processes, or the possibility of finding an external partner to upscale and commercialise specific datasets.
Data discovery: Data lineage provides the ability to decide what data is important and find the right data quickly. This is crucial to business decisions and can help firms remain competitive and identify new business opportunities.
Improved analytics: More reliable and better quality data that is understood and easily accessible supports improved analytics and the knock-on effect of better business decisions.
Increased efficiency: By eliminating duplicated data and redundant data and systems, and providing a clear view of data and how it changes and moves around an organisation, data lineage can provide increased operational efficiency that can support both cost reduction and business needs for fast access to trusted data.
Impact assessment: Data lineage can be used to study how changes to IT systems or business processes could affect specific products or reports downstream.
Cost reduction: Data lineage offers a number of ways to reduce costs. The need to review data across an organisation as a first step of data lineage allows firms to identify and delete any duplicated data, focus on data silos and decide their fate, and discover unused data that can be eradicated and redundant systems that can be switched off. This will optimise a firm’s data footprint and reduce the costs of data management.
Understanding data provides an opportunity to review licensed data, which may be licensed more than once in any one organisation or not used to any great extent, avoid the penalties of using unlicensed data, and renegotiate licenses with data vendors to make external data provision more cost effective.
Data lineage and data discovery can also support new projects at lower cost as some required data and processes can be identified and reused.
Business intelligence and change management: The ability of data lineage to expose an organisation’s data lends itself well to business intelligence and change management. What-if analyses can be made using existing data and processes, starter projects can be undertaken to predict outcomes of change, and favourable projects can be developed quickly using existing and new resources. Rather than calling on IT to build new systems from scratch, the business can discover how new commercial concepts could work before investing in systems.
Data ownership: By clarifying where data is, who uses it and what for, data lineage can allow data ownership to be handed over to relevant individuals, departments or lines of business that can best exploit the data.
What are the challenges of implementing data lineage?
The challenges of implementing data lineage fall into three buckets – operations, technology and data management.
The operational challenges of data lineage start with winning management buy-in and funding for a solution that can be expensive, requires significant human input, and offers only a modicum of advantage in early implementation.
The best approach here is to educate management and start small. Decide whether a pilot project is going to provide insight into business opportunities or achieve an element of regulatory compliance, prioritise the most important and relevant data, scope the project carefully, and identify stakeholders that should be involved.
In the first instance, it may be useful to assess where required data comes from manually and create baseline data lineage before considering automation. It is also important to make sure the pilot project is scalable for other data sources or areas of the organisation before making a business case for lineage.
Proving the concept of data lineage and demonstrating quick wins to the business should, hopefully, be enough to start the journey towards a larger data lineage programme spanning part or all of the organisation.
The technology challenges of data lineage arise from growing numbers of regulations with overlapping requirements, smarter auditors and regulators asking for responses to questions on demand. Technology innovation adds to the challenge, with cloud-based applications and services, big data systems, machine learning, artificial intelligence and natural language processing technologies creating complex infrastructure. Data can be managed in new and interesting ways, but keeping track of it and ensuring it can be trusted is increasingly difficult.
At the heart of addressing these challenges is the selection of a solution, or solutions, to support an organisation’s data lineage. Questions to consider include: how much lineage is already in place; to what extent will manual lineage be necessary; how will lineage be documented; how will it need to be scaled; how will impact assessment be managed; what is the long-term aim for automation; which areas of the organisation will be covered and at what level in terms of technical and business lineage; how will data lineage be sustained; what skills are required; and how much will it cost?
There are no catch-all answers to these questions and few organisations will find answers to all of them in one solution, leading most firms to implement a combination of in-house systems and vendor solutions.
Whatever the selected solution, however, it will not provide value in isolation. It is important to consider how data lineage and its metadata will integrate with the rest of an organisation’s business metadata as this will provide rich data and the ability to slice and dice the data. Data lineage also needs to run alongside an organisation’s systems development lifecycle plan to ensure it is maintained as technologies are changed.
And, of course, scalable and flexible technology is essential, not only to master growing volumes of existing data types, but also to embrace additional datasets, alternative data, regulatory change and new regulatory requirements.
Implementing automated data lineage is a complex data management task that can include huge volumes of data, multiple legacy systems, mountains of spreadsheets, siloed data, uncharted data flows, mixed data formats, and creating metadata to describe the data.
Early considerations include identifying all the data across an organisation, assessing data quality and bringing manual processes into an automated lineage framework wherever possible.
An inventory of data can start the process of identifying which data is important to the business and should be part of a data lineage programme, which data can be left as is, and which data can be scrapped. Challenges here include mining outsourced and black box data, which can be difficult, if not impossible, to capture.
As well as identifying data that can be scrapped, the initial data inventory can uncover redundant systems that can be switched off, reducing the operations burden and the cost of systems infrastructure.
As data lineage is built out, data quality must be constantly monitored to facilitate lineage that is fit for purpose. Data quality can be addressed separately to data lineage, perhaps using the concept of a ‘data quality firewall’ based on a data management platform that enforces data policies and ensures data quality controls are executed before data is input to systems. Alternatively, it can be addressed within a data lineage framework using rules, controls and alerts.
While most data lineage projects start as in-house manual developments responding to a specific requirement, an increasingly regulated environment, growing volumes of data and the need to provide fast access to business data are driving automation, in many cases based on a combination of in-house and vendor solutions.
A typical data lineage automation solution includes functionality that captures and documents data flows, such as a flow of financial instruments, from the data source to its final destination, perhaps a regulatory or internal report. Drilldown functionality allows particular points in the lineage to be inspected more closely, while traceability and audit ensure it is possible to track a piece of data through its journey across an organisation and verify its accuracy. Filtering capabilities allow users to filter for different data categories, such as reference data or trade data, and understand the data’s lineage and attributes.
Another technology facet of data lineage is visualisation, which can provide a real-time view of data moving through processes and systems, improve the understanding of data, highlight any defects in data flows, and visualise the impact of any changes to data and systems. Documentation is managed dynamically to reflect these changes in lineage.
Automation can also capture business logic and/or metadata that can be stored in a repository and used to create source to target data lineage, eliminate duplicated or redundant data, and provide business and technical users with the ability to locate, understand, and manage information that supports business operations.
These types of automated solutions offer many benefits, including the ability to trace data errors, identify discrepancies, control access to information and model what would happen if a new process or department were added to the business. They can also reduce time spent on validating data accuracy and put trusted information in the hands of decision makers.
Vendor solutions provide these types of functionality. There may be slight differences in underlying technologies, scope and potential for automation, but the key difference between vendor solutions is delivery, with some vendors providing cloud-based solutions that can be up and running quickly, and others offering enterprise software solutions that need to be implemented and maintained in-house.
Going forward, data lineage is likely to follow the steady flow of data, applications and analytics into the cloud environment, extensive automation will become the norm, and the goal of zero-gap data lineage will be within reach.
ASG Technologies Group provides more than 3,000 global organizations with a modern approach to Digital Transformation. ASG is the only solutions provider for both Information Management and IT Systems. ASG’s Information Management solutions enable companies to find, understand, govern and deliver information of any kind, from any source through its lifecycle. The IT Systems Management solutions empower companies to support digital initiatives, operate IT infrastructure more efficiently and reduce the cost of managing IT systems landscapes. For more information, visit ASG.com or connect with us on LinkedIn, Twitter and Facebook.
MarkLogic is an operational and transactional Enterprise NoSQL database platform trusted by global organizations to integrate their most critical data. Designed to integrate data from silos better, faster, and with less cost, MarkLogic can help integrate data and build a 360-degree view up to four times faster than if using a traditional database.
3d innovations (3di) – Data lineage for data compliance and licensing solutions
AxiomSL – Data capture and visualisation of data sources, data flows and business logic
Bloomberg – Solutions based on the Financial Instrument Global Identifier (FIGI)
Cambridge Semantics – Automatic capture of schema and statistical metadata describing data sources
Collibra – Interactive data lineage diagrams
Compact Solutions – Metadata integration platform providing data lineage
Datum – Metadata management for use cases including General Data Protection Regulation (GDPR)
Dremio – Data lineage to support analytics
Erwin – Web-based solution mapping data elements to sources
Global IDs – Data lineage layer that maps columns and tables to establish data flow
IBM – Metadata based data and business lineage
Informatica – Data lineage based on a machine learning enterprise data catalogue
Manta – Documents data lineage as it crunches programming code and provides an interactive map
Octopai – Automated cross-platform metadata management and data lineage
Smartlogic – Data lineage based on a semantic AI platform
Solidatus – Visualised data lineage based on metadata management
Talend – Cloud-based open source and enterprise lineage solutions
Trifacta – Data wrangling column-based solution
If you want to appear on this page please contact Jo Webb at email@example.com or call us on +44 (0)20 8090 2055.