Distinguished Engineer CTO and Data-based Pathology @ IBM
March 03, 2022
Do data silos deserve their bad rap?
And now, for the rest of the story…
The issue associated with the dreaded data silo tends to be far less about the actual silo and more about the connection, or lack thereof. For the moment, if you can imagine a diagram where you have two boxes, one on the left and another on the right and in between, a line that connects the two boxes together. The topic of data silos is really about what you can or cannot do with that line.
The Wikipedia entry on Information Silos states:
Information silos occur whenever a data system is incompatible or not integrated with other data systems. This incompatibility may occur in the technical architecture, in the application architecture, or in the data architecture of any data system.
In a recent McKinsey report on the data-driven enterprise of 2025, data silos are blamed for their contribution to organization difficulty:
Data sets are also stored—sometimes in duplication—across sprawling, siloed, and often costly environments, making it difficult for users within an organization (such as data scientists looking for data to build analytics models) to quickly find, access, and integrate the data they need.
But let’s start by taking a step back and examining the nature of the data set from the McKinsey quote. First, the extreme opposite end of the spectrum from having “data sets” (plural) is having just one—a single data set.
Let’s take a moment to hypothesize a given scenario. Suppose we could overcome all aspects of physics and create a single data set (or database, or data store, etc.) that could hold any amount of data, address the most complex of queries, and return any size result in zero seconds. Would we create solutions based on a single data set paradigm?
After all, performing any workload in zero seconds from a single data set should undoubtedly help eradicate the data silo problem, would it not?
Perhaps, but other problems would likely creep in and cause further complications. For example, adding security policies to a single data set that could accommodate any and all enterprise data needs (from human resources to operations, etc.) would likely yield a significant vulnerability. Between principles of zero trust and least privilege to access control lists, role-based access controls, attribute-based access controls, policy-based access controls, you just know that over time somewhere, something will get overlooked!
Furthermore, if a change needed to be made to the data set, hundreds or even thousands of production applications could be simultaneously impacted. So, even if a single data set were physically possible, it could prove to be nearly impossible to secure and manage.
If having one data set is not the answer, is the correct number? Two, three, four, forty, four hundred? What then?
First, we need to understand why we have to endure multiple data sets. That part is relatively easy to answer! It’s physics. In our physical world, we can’t rely on returning any size data set, from the most complex of queries, against any volume of data in zero seconds. Our choice of technology (from hardware to the database), location (cloud, on-prem, etc., and including network connectivity), and schema design will all serve to address the fundamental issue we all face: physics.
Next, data sets that are geared towards serving a specific business need will generally offer a higher degree of flexibility and provide for an easier time with maintenance (relatively speaking). Specialized content within a data set ultimately contributes to simplifying data sharing. Security is yet another consideration. The actual number of data sets needed is variable and is never going to be set in stone. The final number is a culmination of variables that must also include corporate culture and individual experience. Finally, a data architecture is one of the means by which we can seek to address the best practice of having more than one data set.
A data silo is undoubtedly a double-edged sword. On the one hand, it’s a best practice to establish the silo, but to do so without consideration for interoperability and interchange is the other side: the anti-practice.
In the end, it’s not the data set part that plagues the data silo. It’s the “line.”
Ultimately, our choices for technology, location, schema, management (including governance), and security should be on how to manage the lines. How to add a line, how to deprecate a line, how to modify a line. The principal topic for opening a dialog on data silos should be “line management.” How are we going to manage the line? Knowing what can be done with your lines should be an architectural tollgate. The “line” should bear the bad rap, not the silo.