When PR and reality collide: The truth about machine learning in cybersecurity

Machine learning (ML) is routinely cited by post-truth vendors as their biggest selling point, their main advantage. But ML – if it’s done properly – comes with problems and limitations.

ESET has spent years perfecting automated detections, our name for ML in the cybersecurity context. Here are some of the biggest challenges we have observed and overcome in the course of implementing this technology in our business and home solutions.

First, to use machine learning you need a lot of inputs, every one of which must be correctly labeled. In a cybersecurity application this translates into a huge number of samples, divided into two groups – malicious and clean. We’ve spent almost three decades gathering data to train our ML system.

Where would a recently formed post-truth vendor get such data? Unless it resorts to the unethical use of competitor research, there is no way to create a sufficiently large or reliable database.

Garbage in – garbage out

"Even when an ML algorithm has been fed a large quantity of data, there is still no guarantee that it can correctly identify all the new samples it encounters."

Even when an ML algorithm has been fed a large quantity of data, there is still no guarantee that it can correctly identify all the new samples it encounters. Human verification is therefore needed. Without this, even one incorrect input can lead to a snowball effect and possibly undermine the solution to the point of complete failure.

The same situation ensues if the algorithm uses its own output data as inputs. Any error is further reinforced and multiplied, as the same incorrect result enters a loop and creates more “trash” – false positives or misses of malicious items – that then reenters the solution.

Some post-truth security vendors claim that similar situations can’t happen with their machine learning algorithms, since they can identify every sample before it runs and determine whether it is clean or malicious just by doing the “math”.

However, the famous mathematician, cryptanalyst and computer scientist Alan Turing (the man who broke the Nazi Enigma code during WW2 at Bletchley Park in England) proved that this isn’t possible. Even a flawless machine would not always be able to decide whether a future, unknown input would lead to unwanted behavior – in Turing’s case, one that would make the machine loop indefinitely.

Fred Cohen, a computer scientist who formulated the definition of a computer virus, went one step further and demonstrated that this so-called “halting problem” applies to cybersecurity as well. It is what he called an “undecidable problem” to say whether a program will act in a malicious way if one only observes its external appearance. The same problem emerges for future inputs, or specific settings that might push a program into the malicious sphere.

So how does this apply to the current state of cybersecurity? If a vendor claims its machine learning algorithm can label every sample prior to running it and decide whether it is clean or malicious, then it would have to preventively block a huge amount of undecidable items - flooding company IT departments with false positives.

The other option would be less aggressive detection with fewer false positives. However, if only machine learning technology is applied, it would shift detection rates far away from the claimed "100%" silver bullet efficiency.

The cybersecurity “game” can change at any point

This leads us to one of the most serious limits on the application of ML technology in cybersecurity – the intelligent adversary. Three decades of experience in the field has shown us that counteracting such an opponent, i.e. a human being, is a cat and mouse game that never ends. Every time we protect our clients from malware, attackers try to find a way around our solutions. We upgrade our protection, and they look for more loopholes, and so on.

The ever-changing nature of the cybersecurity environment makes it impossible to create a universal protective solution, unless we want to deny the existence of progress on both sides of the barricade – white and black hat. ESET believes that we have to adapt and respond to the evolving threat landscape that actually exists, not some static, imaginary equivalent.

"In cybersecurity, the attackers don’t play by any rules. What’s worse, they are able to change the entire playing field without warning."

You might argue that machines have gotten smarter to the point where they can defeat humans at their own game – such as Google’s algorithm AlphaGo – and you would be right. However, these algorithms have only a very narrow focus, and function in a setting with predictable rules. In cybersecurity, the attackers don’t play by any rules. What’s worse, they are able to change the entire playing field without warning.

To combat an opponent with this so-called general intelligence, a security solution would need to be built around an equally strong (or general) AI, one able to adapt to new environments and new challenges. Today’s weak (or narrowly focused) ML is simply not up to that task.

With a purely ML-based cybersecurity solution, it only takes one successful attack from malicious actors, and your company’s endpoints might be left unguarded against a whole army of cybercriminals. ESET’s solutions, therefore, feature more than just ML. We use multiple technologies – typically missing from post-truth vendors’ products – to keep crooks out thanks to high detection and low false positive rates.

The whole series: