Machine-learning algorithms need transparency to comply with GDPR

The European Union’s General Data Protection Regulation (GDPR), which will come into force on May 25, 2018, redefines how organizations are required to handle the collection and use of EU citizens' personal data.

Debates around the GDPR focus mostly on the global reach of this legislation, the draconian fines it introduces, or its stricter rules for “informed consent” as a condition for processing personal data.

However, one challenge the GDPR brings to companies is often overlooked: the citizens’ right to explanation.

Legal details aside, the GDPR mandates that citizens are entitled to be given sufficient information about the automated systems used for processing their personal data in order to be able to make an informed decision as to whether to opt out from such data processing. (A legal analysis, comprehensive yet understandable to non-lawyers, can be found here.)

The right to explanation has long been overlooked. Besides low awareness of the right itself, it is not widely understood that this newly-introduced privacy protection brings a significant business risk to companies that process citizens’ data.

Yes, other citizens’ rights introduced or expanded by the GDPR, like the right to object to profiling, the right to obtain a copy of personal data gathered, or the right to be forgotten — can all be costly to comply with. But many companies are finding themselves incapable of providing an explanation of the results of their personal data processing. And worse – they often simply can’t figure out how to comply with this GDPR-imposed obligation.

Our black-box has decided

The problem is that systems that process citizens’ personal data often rely on machine learning. And, unlike standard “if-then” algorithms, machine-learning models are kind of a “black box” – no one knows exactly what happens inside and the exact reasoning behind the output.

This is especially the case with methods relying on neural networks. Decision-tree-based machine-learning methods allow, in theory, for determining the learning path. However, severe constraints exist that make any explanation extremely difficult.

Let’s look at an extremely simplified example. Imagine that a bank has a machine-learning system to determine the creditworthiness of those who apply for a loan. Based on data about previous loans – including their outcome, labeled as “good” or “bad” – the system learns on its own how to predict whether a new application would end up being a “good” or “bad” prospect for a loan.

The reasoning for the prediction – based on which a determination is made as to whether the applicant will or will not be able to afford to own a house, for example – lies with how a complex web of thousands of simulated neurons processes the data. The learning process consists of billions of steps and is difficult to trace backwards. Not only technically, i.e. due to technological constraints, but also due to fundamental limitations of the underlying mathematical theories, no one can really tell exactly why any particular sample of data was labelled as “bad”.

Between a rock and a hard place

Machine learning has become a method of choice for processing large datasets and sorting samples into groups. For this reason, the right to explanation poses a fundamental challenge – and a risk of non-compliance – for all those dealing with piles of personal data of European citizens.

Unless companies processing citizens’ personal data fully understand the reasoning behind the decisions made based on their machine-learning models, they will find themselves between a rock and a hard place. They must prevent their customers from opting out from automated processing of their personal data (to save costs and keep the business running) while preserving the illusion that the company is really respecting the customer's right to have a standard explanation, plus the right to have a human review should there be a contested result (so that the company can avoid those huge fines the GDPR imposes for non-compliance).

Basic research is needed

To be able to explain the reasoning behind their automated decision-making processes – and thus grant the right to explanation to their customers — companies must wait until radical improvements in understanding how machines learn. Simply put, machine learning processes must become transparent – if not truly transparent, then at least much less black box-like – for companies that fall under the GDPR to be able to become compliant.

However, transparency of machine learning is a tricky beast which has unpredictability – non-transparency, if you will – rooted deep in the foundational mathematical theories it is based on. For this reason, the solution of the right to explanation problem requires improving the theoretical foundations of machine learning.

Machine-learning scientists are already shifting their focus this way; however, it might take years before we see any GDPR-applicable results.

Transparency: a need or a threat?

Unlike marketers and others who process personal data en masse and must be compliant with privacy regulations, cybersecurity companies do not welcome such a shift in machine-learning research.

More resources allocated to understanding the models (i.e., for the sake of transparency) means fewer resources devoted to making the models more accurate and effective.

For us, malware hunters, having machine-learning models accurate and effective is paramount – while transparency of our machine-learning models is the very last thing we need. After all, we don’t want to see cybercriminals successfully fine-tuning their malicious code to sneak past our protections, do we?

However, we must be prepared for our adversaries upping their game based on a better understanding of how our machine-learning models work.

Undoubtedly, it’s important to improve our machine-learning models and make them more sophisticated and thus harder to bypass. However, the most important measure in this regard is to have more layers of protection.

The advent of tools for uncloaking machine-learning models clearly shows how fragile the protections can be that rely purely on these models. In my opinion, testing organizations should develop more sophisticated methods for testing security solutions' resilience against methods aimed at bypassing security products' detection mechanisms based on knowledge of how those mechanisms work. These advanced tests are needed to distinguish solutions that are reliable and hard to bypass from those that work only under ideal conditions.

About the Writer: Juraj Jánošík, Automated Threat Detection and Artificial Intelligence Team Lead, ESET.