[Update: John Leyden's own article on the topic is now up here. (Actually, it's been there for a while: I've just been a bit busy!]
The Register‘s John Leyden drew my attention to research by Carbon Black, a company marketing a host-based intrusion prevention system, indicating that if an AV package hasn’t added detection for malware within six days of its first being detected by another company, the chances are it still won’t detect the sample 30 days later. Carbon Black reached this conclusion after tracking the detection rates of 43 products for 84 random malware samples on the VirusTotal website.
The Carbon Black study has methodological drawbacks which will affect its accuracy. (That doesn’t mean its conclusions are completely wrong in this case, but I’ll get to that in a minute.) In fact, the study acknowledges one of those drawbacks, but draws conclusions based on those statistics anyway.
“As was pointed out when we conducted this study the first time, individual AV results vary based on configuration. Also, we did not include any of VirusTotal’s new sandboxing results in the most recent study so the results, just like the previous study, are limited to static signatures.”
Actually, while that statement does encapsulate a real problem, it slightly overstates it. The study is only limited to static signatures in the case of products that only use sandboxing/active heuristics/behaviour analysis in realtime scanning. The advantage of emulation is that it isn’t restricted to on-access/realtime scanning. An on-demand scanner can analyse the behaviour of a program dynamically because it runs it in a safe (emulated) environment. (I’m assuming that R.M. Gerard is using ‘static signature’ in this second Carbon Black article in the sense of ‘restricted to static analysis’ rather than in the sense of an old-school signature based on a static search string rather than a more sophisticated algorithm.)
Still, equating VT statistics with a product’s detection can be seriously misleading, and Julio Canto of Hispasec/VirusTotal and I have previously addressed the issue in a joint paper though we were particularly concerned about the implications for quasi-testing, and Gerard avoids that particular trap. Both the paper and VT’s ‘about’ page quote a highly relevant article from Bernard Quintero that summarizes the ‘product performance versus VT reporting’ problem succinctly:
There is, however, an issue that Gerard doesn’t mention. By using random samples from malc0de.com, it assumes that all samples from that source are equally valid. In other words, that they’re not only valid and validated malware, but that they’re samples which ‘should’ be detected using default settings, or at least the settings used in the scanner versions used by VirusTotal. In fact, that ignores the significant class of threat that the AV industry usually calls Possibly Unwanted – stuff that may not be referred to as unequivocally malicious, but that the customer probably wouldn’t want on his/her system anyway. Most products make that an option rather than a default because it reduces their exposure to legalistic manoeuvres: a lot of AV industry resources are tied up in mischievous litigation (or threats of litigation) by people pushing undesirable software. Making detection of PUAs/PUPs/PUS an option pushes some of the responsibility back to the customer by forcing them to make a conscious configuration choice, but it reduces an AV vendor’s own attack surface, which allows it to free up more resources for dealing with unequivocally malicious software. Well, that’s the hope…
Of course, it may be that the samples used in this case are all perfectly valid in that sense, and that there are no flaky samples (corruptions, FPs): we simply don’t have that information. But you can’t make that assumption on the basis of malc0de.com‘s own assessment of its utility and capabilities:
“Initially malc0de.com was created to link domains that were serving the same executable. What I found out in a very short period of time is the binaries are updated so frequently that this becomes almost impossible. Storing the MD5 is still useful just not as useful as I originally thought. The only purpose malc0de.com is to store and keep track of domains that host malicious binaries.”
Still, moving on and assuming that the quality of the samples/validity of the hashes doesn’t affect Carbon Black’s two main hypotheses…
Does the hypothesis that 43 scanners are more ‘effective’ than a single scanner hold up to scrutiny? Well, I haven’t subjected it to statistical analysis myself, and obviously I’m not about to accept Carbon Black’s analysis uncritically and unequivocally. But it does make sense, subject to reservations about what ‘effective’ actually means. It certainly doesn’t mean loading 43 products onto a single system; I’m not sure it means acting immediately on a single company’s ahead-of-the-curve detection that turns out to be a false positive; and it doesn’t necessarily mean having detection for every one of the hundreds of thousands of unique binary samples that find their way into an AV lab on a daily basis. (By unique binary sample, I don’t mean a program totally discrete from all other malware: I mean a sample packaged so that it’s different to every other sample, not least in it has a different hash value. The base code isn’t necessarily any different, but the use of packers and obfuscators may entail a different or modified detection algorithm.)
Back in my NHS days, I was told by someone with far more influence on NHS security policy than I ever had that it didn’t matter which AV product(s) the NHS used because they all detect much the same range of threats. (By all, I assume he was referring to the mainstream commercial products.) Well, yes and no. Disregarding the fact that malware sample glut has increased dramatically since those days, it’s reasonable to assume that any one mainstream product will have access to and detection for day zero samples that other products won’t have: someone has to see a specific malicious binary first. And because mainstream players (not only vendors but testers) have multiple channels for sharing information, hashes and binaries, detection of significant, active threats cascade through the industry. (In general, we don’t prioritize competitive advantage over the welfare of the community as a whole. Hopefully.) So detection of a high-prevalence or high-impact or high-profile threat (not all malware falls into all three categories, of course, but one is probably enough to ensure prompt sharing) can usually be seen, if we must use VT as a metric, to go from one or two to double figures in a very few hours.
If you follow Carbon Black’s suggestion of ‘leveraging’ VT reports, you can probably get ahead of the curve, though that isn’t a risk-free strategy. (Short term FPs, for instance.) However, what that wouldn’t get you is detection of “all malicious samples on day one!” Embarrassing though it may be for our marketing departments, Stuxnet and its siblings have proved pretty conclusively that the entire security industry can completely miss a significant threat for extended periods. It’s probably safe to assume that there are threats that are never detected by any product. That’s bad news because there’s a false expectation that all badware will eventually be detected by AV – “If you were able to run all AVs together your systems would have been 100% secure”. (Though I’d say myself that if anyone really thinks that any security solution will give them 100% protection against all malware, let alone all security threats, I’ll be happy to sell them Tower Bridge and I’ll even throw in a couple of cathedrals.) But it’s not as bad as it sounds because the assumption that all those binaries are equally significant is unfounded. Let’s take just a few scenarios:
If you compare these in terms of significance, you’re not just comparing apples and oranges. You’re comparing sheep, castor oil and ball bearings. The bottom line is this:
Let’s look at one more aspect of the CB study:
“Let’s assume that a single signature can detect 100 malware variants. If so, one would have to write 7,835 signatures per day just to handle the 783,561 malicious samples being reported. These signatures will accumulate over time, requiring an AV to check each newly created file against an ever-growing list of signatures, which dramatically slows a user’s machine down to a crawl.
As a result, AVs must keep their signatures small and relevant, perhaps needing to remove an old signature for each new one added. Although we can’t guarantee this to be the case, it’s certainly a valid hypothesis as to why certain AVs detected fewer of our samples on day 30 than on day 1.”
Modern products are indeed highly generic in their approach to detections (signatures, if you insist, though that term is misleading). Actually, many detection algorithms are capable of detecting many more variants and subvariants than 100 (we’re talking lots of zeroes here, in some cases…) But we don’t just add a detection for each processed sample, we modify a detection as necessary – of course, a good heuristic will sometimes detect many unknown samples without needing immediate modification. And sometimes a highly generic detection will be superseded for certain samples by a more specific detection, as more information is gathered on that particular threat family. But I doubt if any mainstream vendor pulls a detection within 30 days of an initial detection just to make room for another detection. If a vendor stops detecting something, it’s likely to be something else entirely: recognition of a false positive, reclassification (for instance as Possibly Unwanted), or even a process error.
I think I feel a paper coming on. ;-)
David Harley CITP FBCS CISSP
ESET Senior Research Fellow