Data lineage has become a daily demand for various stakeholders. However, this concept still remains unclear and unknown to many. I’ve been dealing with data lineage for six years already. The most common questions business users have about data lineage are:
- What is data lineage?
- Why do I need it?
- How can I apply it into my daily practices?
In this blog, I will briefly answer these questions.
Question 1: What is data lineage?
From the common viewpoint, data lineage describes a process of data transformation along data chains from the point of data origination to the point of its consumption. A data chain in this context is a set of applications. However, several challenges with the definition of data lineage exist.
The first challenge is that we don’t have one unambiguous definition even within a data management community. There are several reasons for that. Data lineage can be documented at different abstraction levels. Data management professionals often think about data lineage in the context of the physical level of databases. However, data lineage can also be documented at higher abstraction levels. Data lineage serves different needs of various stakeholder groups. Therefore, different business users require data lineage at various abstraction levels. So, people may use the same term “data lineage,” but mean different things.
The second challenge is that several other concepts have similar definitions. These concepts are data chain, data value chain, integration architecture, data flow, etc. People can use different terms and mean the same thing in this case.
The third challenge is that data lineage can be of various types. Let me present a simple example. Data management and IT professionals understand data lineage as a description of a data transformation process. Business users think about data lineage as a description of changes in data itself. Data lineage varies depending on the type of documentation, for example, descriptive and automated. Another classification demonstrates the differences in the direction of documentation: vertical and horizontal. Various stakeholders have needs in various data lineage types.
The essence of the answer is: “always define your definition of data lineage before you start talking about it.”
Question 2: Why do I need it?
Various stakeholder groups have different needs in data lineage. Let’s take as an example three groups of professionals: business, data management, and information technology (IT).
Business users from financial, risk, and compliance departments need data lineage for several reasons. First, data lineage assists in complying with some regulations. Personal data protection regulations affect all businesses worldwide. Data lineage, in this case, makes data processing transparent: it allows tracing the origin, destination, and transformations applied to personal data. Data lineage is also crucial for financial professionals as it helps to answer audit questions about data origin and applied transformations. Data lineage also assists in optimizing business processes and application architecture that leads to reducing operational risks and costs.
Data management professionals can’t perform multiple data management initiatives without knowing data lineage. These initiatives include but are not limited to master and reference data, metadata management, and data quality. Let’s take a data quality initiative as an example. There are at least two areas where knowledge of data lineage is a “must” condition. The first area is a root-cause analysis used in analyzing data quality issues. The second one is the building data quality checks. Data lineage enables all these initiatives. So, it is important to understand that before starting these initiatives, a company should have data lineage in place or at least start the initiative in parallel.
IT professionals need data lineage for realizing migration or DevOps projects. Optimization of data and application landscapes is also impossible without knowing data lineage. So, data lineage as a “must” condition is also valid for these initiatives.
The essence of the answer is: “business users should clearly identify their purpose of data lineage usage and clarify their requirements at the early stages of a data lineage initiative.”
Question 3: How can I apply data lineage into my daily practices?
This is the most challenging question to answer. If you think that data lineage documentation is the most difficult step, you are wrong. Of course, data lineage documentation is a time and resource-consuming initiative. However, when data lineage is ready, a company faces the most significant challenge: bringing data lineage outcomes into “business as usual” operations. To allow business users to apply data lineage in their daily practices, data lineage should meet their needs and requirements. Furthermore, one big challenge is difficult to oversee at the beginning of the data lineage initiative: validation of data lineage quality. Validation of data lineage quality takes sufficiently more time than its documentation. The challenging question is, “Who is accountable and responsible for data lineage validation”? This answer is also not straightforward as it depends on multiple factors: the abstraction level and documentation methods, for example. In my opinion, business users who are supposed to use data lineage should be accountable for the validation. However, data management and IT professionals should assist and perform validation.
The essence of the answer is: “to be effectively used, data lineage should meet the needs and requirements of targeted business stakeholders for whom it was originally developed.”
For more answers to questions about data lineage, you can consult my book “Data Lineage from a Business Perspective.”