With the introduction of predictive analytics, obtaining and maintaining good quality data has become imperative. This process of maintaining data quality is time consuming and often inadequate when manually done. It is also cumbersome process which can often result in poor quality data and have long-lasting impact on business analytics. When manually done, these processes are full of gaps and have low maintenance which result in poor quality data, which ultimately renders any organization’s data science and analytics department handicapped. To top it off, there may be concerns regarding the data quality checks being executed at the technology level itself. How do we ensure that all data quality dimensions are covered by the technical team? Is there a way to know that the tests they are executing are aligned with the data governance initiatives and adding value to the governance and in turn business processes? The proposed article discusses how Benford’s algorithm in combination with some specific measures can be used to identify data quality issues.
So, what is Benford’s law?
Benford’s law is a lesser-known law of numbers which pops up everywhere to the observer- from nations’ population to stock market volumes to the domain of universal physical constants. Benford’s law, or Benford-Newcomb law, (also known as the law of anomalous numbers or the first digit law), is an observation that states that in naturally existing sets of large groups of numbers, the first digit is likely to be small. Which means, in sets that obey the law, the numbers starting with digit 1 appears as about 30 % of the time, while numbers starting with digit 9 appears less than 5 % of the time. So, a given set of naturally occurring numbers (Natural numbers are those numbers that are not ordered in a particular numbering scheme and are not human-generated or generated from a random number system.), without any human intervention will more or less follow the given graph:
What does it mean to us in terms of data quality and governance?
It means that if we have the data (numbers) for all the measures from different applications within a business unit or a line of business or even an entire organization, we can easily use it to check the standards of the data. If the data loosely fits into the graph, the measures provided by the business owners and the governance team is good to go (which also, implicitly means that the numbers showing up in corporate reports are trustworthy), however, any large deviations from the graph would be a matter of concern and probably require further inspection. The best part of this implementation is the larger the set of the data to be plotted, the better the results.
What would the measures be?
The measures based on which the graph can be plotted can be any quantifiable parameter – number of core/ critical elements per application/domain, number elements passing/failing the data quality checks per application/domain, number of open issues, number of applications successfully onboarded into the data strategy, dumber of elements passing/failing each data quality dimension and so on. In our case, since the objective of discussion is data quality, we will focus on the data quality dimensions and quality check results (pass/fail) counts, tickets raised based on quality checks.
Based on this understanding, a simple experimentation was done as below:
This was performed on a set of 75 applications and the measure considered here is the “number of Quality Check passed elements per application”. The information below is from some sample data that was collected from external training sources:
From the above data, the sum of first digits was calculated and a graph of the sum was plotted against Benford’s curve as below:
As seen in the above test experiment, the number of QC passed data elements present within these 76 applications loosely fall into the first digit law or Benford’s law. As it can be seen, a simple graph can help identify if the number of elements passing the QC checks are as per the objective or not. On similar lines, graphs can be drawn for different measures (every data quality dimension, failed QC checks) as previously mentioned.
If the user further wants to check which are the applications which may be showing improper numbers (overshooting or undershooting the requirements), further analysis can be done to with a much smaller scope. For example, in this case, the applications which have “digit occurrence=8” may have inappropriately passed some data elements which should, ideally have been failed or vice versa or have falsified the numbers to the management, so we can check applications 30,45 and 63 to identify if there are any discrepancies in the data quality checks which have been passed in QC. Similarly, the checks can be performed for each data quality dimension within a business unit, LOB or organization.
In large organizations, where there are a plethora of applications and mammoth size of data, which will keep on increasing every passing minute, a dashboard or software tool based on Benford’s law is a good point to start – when you must answer questions such as “how good are your data quality checks?” or “What is the quality of your data quality checks?”
This community never failed to provide good topics and valuable new knowledge.