GDPR or the Global Data Protection Regulation, is an EU legislation that went into effect on May 28th 2018. Since it was first introduced in 2016, your company amongst others that conduct business in Europe, invested a lot of resources to achieve compliance. You probably hired a Data Protection Officer and conducted several training sessions to your staff to ensure understanding of the new rules, you put in place new processes to document and classify the data you have, you introduced and established consent procedures, conducted several information audits, and reviewed your data governance process. According to FT, your company and others globally spent billions of dollars in preparation for the enforcement of the regulation. And according to a PwC report, more than 88% of companies spend $1 million and 40% spend more than $10 million as cost of maintaining GDPR compliance.
And yet.
Google, H&M, TIM, British Airways and Marriott to cite a few have paid hefty fines in failure to comply with the guidelines. And unless you’ve been living under a rock for the past few weeks, you’ve probably heard of the record Amazon GDPR fine. On July 30th, news broke that Luxembourg’s National Commission for Data Protection (CNPD) has hit Amazon with a record-breaking €746 million ($887 million) GDPR fine over the way it uses customer data for targeted advertising purposes. The fine is unprecedented, it is the biggest GDPR fine issued to date and is more than double the amount of any other GDPR fines combined.
So what is GDPR? What Engineering challenges does it bring? And how to build the right Data Infrastructure to ensure GDPR compliance?
What is GDPR? I’ll keep it short.
GDPR is a regulation that requires businesses to protect the personal data and privacy of EU citizens for transactions that occur within EU member states. To that extent, every company that provides a product or a service to EU citizens or organisations is required to comply with GDPR. It came into force on May 25th 2018 and is aimed at giving every EU citizen the right to know and decide how their personal data is being used, stored, protected, transferred and deleted.
What Engineering challenges does it bring?
To understand the challenges GDPR brings to your Data Infrastructure let’s go over some of the pillars of the framework. Under GDPR, EU citizens are given a certain set of rights:
The right to be forgotten
Complete elimination of users’ data must be conducted upon their request. In the era of auto scheduled backups, non-volatile storage systems, and all-pervasive caching, this represents a real engineering challenge.
- The right to data portability
Or right to request. And this means that users have the right to retrieve all the information a company has collected from them in an exportable universally readable format, a.k.a. another technical challenge to overcome.
- The right to object processing
While keeping the data (upon user consent and for “necessary” business operations) is allowed, additional explicit user consent is still required to process the data. Think about how you are going to exclude certain records when writing SQL code…
- The right to rectification
This gives users the right to change their PII data in your system as they see fit. From an engineering perspective, this means your company will need to have a way of tagging and enumerating PII-related data.
- The right to be informed
Users need to be informed when their data is being collected and for what purposes. In case of a data breach, users need to be notified as soon as possible and data protection authorities informed within 72 hours.
Taking all this into account, and mindful of other broader engineering challenges GDPR might bring, let’s dive into the challenges Data Engineers are facing and how to resolve them.
Personal Data
Let’s start here. A key requirement to compliance with GDPR is being able to locate, enumerate and access all user data classified as PII. You also need to think about PII security measures (encryption and fine grained security access for PII) but that’s a whole other complex topic.
The right to be forgotten
The idea here is having the ability to remove ALL records related to a user that are distributed across a number of databases, tables, and systems should he/she make the request. Pseudonymization of data (GDPR articles 6, 25, 32, 89) could bring a solution to this, but we’ll dive into that later.
Bulk deletion from all storages, especially the places where the entries are used in aggregate metrics or have different identifiers can be extremely hard to implement properly. Orchestrating a cascade like deletion without assessing the impact on other related data assets and BI reports is a recipe for disaster. Data lineage here is particularly important, more on this later.
Deleting rows that were once allocated to a certain user vs replacing PII data with a “removed user” placeholder can present some storage challenges especially in SQL based databases.
Backups can quickly go from being a routine procedure to becoming an absolute engineering nightmare. You are faced with either: no back up of PII at all, or the pseudonymisation of the backup. If you chose pseudonymisation, you need to have a proper mechanism that matches Users_IDs to PII identifiers in a way that allows you to get rid of “forgotten” Users_IDs without having to go through the backup again to delete the users’ data. In addition, you should also try to keep a separate table of forgotten user IDs that should be in a separate database with a different backup/restore process, so that when you restore a backup you omit the forgotten users.
The right to object processing
This is arguably the least challenging requirement, although preventing an automated system from processing the data it stores may seem like an arduous and almost illogical task. Few ideas here:
Adding a table of “restricted users” and filtering the outputs against it. Easy and works for most modern databases.
Adding a boolean column / field to all tables and collections containing PII is another way of solving this. Less easy and requires some careful orchestrating.
A more sophisticated solution would be to add in your admin panel as well as within the users settings page, a button labelled “restrict processing” when clicked should mark the profile as restricted. That should create a “wall” preventing the back office staff from accessing the data and hence processing it.
The right to be informed
Under GDPR, when collecting user data for a particular business use case, users will frequently have to understand and agree to whatever it is your organisation wants to use their data for. This can be a real nightmare if your organisation, like many, doesn’t fully understand the data it has, where it is being stored, or its specific business use case once it’s been stored. Often organisations “hoard” data hoping that the use cases will define themselves later. In order to ensure GDPR compliance, companies will need to understand exactly why they are collecting data and get into the habit of tagging that data at the time of collection.
The right to rectification
This might seem like an obvious rule but isnt always diligently followed. A user should be able to access and edit all sorts of personal data you’ve collected about them including PII you would’ve fetched from third parties (Salesforce, Apple login, Facebook ID etc). As a general rule here, all personal data should be editable through the UI.
More to come on How to build the right Data Infrastructure to ensure GDPR compliance in Part 2!