How to build the right Data Infrastructure to ensure GDPR compliance?
Let me start by saying that when it comes to building a Data Infrastructure that complies perfectly with GDPR, there is no magic recipe or one size fits all approach. However, there are certain best practices and tools that can enhance overall governance.
In no particular order, the below tools can be used in conjunction with each other or to complement an existing infrastructure in order to achieve Data protection and GDPR compliance. Perhaps the key thing to keep in mind when building your data infrastructure is; you need to be able to easily access ALL the PII data within your organisation, tag it and segregate it into a separate table making it easy to encompass and delete if necessary.
Data Lineage
Data Lineage allows organizations to trace the movement of data from its source to its point of use, providing visibility into all the ways it has changed from point to point. It is already being widely used in heavily regulated industries such as Banking, Insurance and Healthcare as a way to maintain data-related regulatory compliance. Particularly on “the right to be forgotten” and the “right to be informed” pillars by visualising and mapping metadata, Data Lineage allows to understand in which table PII resides, presents the full lineage and interdependencies from a specific BI report through the tables and ETL processes. Hence 1. keeping a record of usages by the business and 2. assessing the impact of a row deletion and ensuring it is orchestrated in a GDPR compliant manner.
.

Data Catalog or a Metadata Discovery tool
A Data Catalog is a detailed inventory of all data assets in an organization and their metadata, designed to help data professionals quickly find the most appropriate data for any analytical business purpose, by facilitating the access and classification of data at scale.
Data warehouse architecture modules
Consent
When it comes to consent, GDPR requires explicit permission from the user to process their data. Practically speaking, for each particular processing activity there should be an explicit “check box” on the UI. From a data storage perspective, you should keep these consent checkboxes in separate columns in the database. In the event the user unchecks the box from his/her profile as a consent withdrawal, a fetching mechanism would allow you to link to the processes reliant on PII related to a certain user and exclude it.
Pseudonymisation engine
As defined in article 4 of GDPR, pseudonymisation is a data management and de-identification process by which personally identifiable information (PII) can no longer be attributed to a specific data subject without the use of additional information. When implemented properly, pseudonymisation ensures a certain level of protection during the processing of personal data. Although pseudonymisation is not completely exempt from data privacy requirements as re-identification remains possible, it can prove to be advantageous for comprehensive data analysis particularly (if done properly and consistently across all data processes) in the data warehouse domain comprising analytics.
Deletion engine
A good practise when it comes to carrying out PII deletions is to conduct them in batches, the best way to incorporate this is to flag and date PII data as a data management process and then conduct batch deletions once a month within the 30-day window stated by GDPR.
Information reports and tracking
By leveraging your metadata management and lineage engines, you should be able to automate identification and localisation of personal data within your data warehouse and storage systems and query the metadata table to generate reports for individuals who make the request.
Conclusion
It goes without saying: Data is every modern company’s greatest asset, and therefore, regulating Data is indirectly shaping the economy and the way modern day organisations conduct their businesses.
The introduction of the European Union’s General Data Protection Regulation (GDPR) in 2018 pioneered Data Protection Regulation by introducing a new set of rules for global companies operating in the EU. The California Consumer Protection Act (CCPA) came into play later on the 1st of January 2021, while similar legislation has also been passed in countries such as China, New Zealand, Canada and South Africa.
Despite it being an EU legislation, the GDPR has far reaching implications. So far since its enforcement on May 25th of 2018 companies like Amazon, Google, H&M, British Airways to cite a few have paid hefty fines in failure to comply with the guidelines.
There are two main factors that make it difficult to support GDPR in Big Data technologies: immutable nature of storages and infeasibility of partitioning datasets by single user.