Malware Detection, Virus Total, and Carbon Dating

Carbon Dating and Malware Detection

Carbon Black assert that if an AV company doesn't detect malware within six days of its being flagged on Virus Total, it probably won't after a month. Is that as dangerous as it sounds?

Carbon Black assert that if an AV company doesn’t detect malware within six days of its being flagged on Virus Total, it probably won’t after a month. Is that as dangerous as it sounds?

[Update: John Leyden’s own article on the topic is now up here. (Actually, it’s been there for a while: I’ve just been a bit busy!]

The Register‘s John Leyden drew my attention to research by Carbon Black, a company marketing a host-based intrusion prevention system, indicating that if  an AV package hasn’t added detection for malware within six days of its first being detected by another company, the chances are it still won’t detect the  sample 30 days later. Carbon Black reached this conclusion after tracking the detection rates of 43 products for 84 random malware samples on the VirusTotal website.

The Carbon Black study has methodological drawbacks which will affect its accuracy. (That doesn’t mean its conclusions are completely wrong in this case, but I’ll get to that in a minute.) In fact, the study  acknowledges one of those drawbacks, but draws conclusions based on those statistics anyway.

“As was pointed out when we conducted this study the first time, individual AV results vary based on configuration. Also, we did not include any of VirusTotal’s new sandboxing results in the most recent study so the results, just like the previous study, are limited to static signatures.”

Actually, while that statement does encapsulate a real problem, it slightly overstates it. The study is only limited to static signatures in the case of products that only use sandboxing/active heuristics/behaviour analysis in realtime scanning. The advantage of emulation is that it isn’t restricted to on-access/realtime scanning. An on-demand scanner can analyse the behaviour of a program dynamically because it runs it in a safe (emulated) environment. (I’m assuming that R.M. Gerard is using ‘static signature’ in this second Carbon Black article in the sense of ‘restricted to static analysis’ rather than in the sense of an old-school signature based on a static search string rather than a more sophisticated algorithm.)

Still, equating VT statistics with a product’s detection can be seriously misleading, and Julio Canto of Hispasec/VirusTotal and I have previously addressed the issue in a joint paper though we were particularly concerned about the implications for quasi-testing, and Gerard avoids that particular trap. Both the paper and VT’s ‘about’ page quote a highly relevant article from Bernard Quintero that summarizes the ‘product performance versus VT reporting’ problem succinctly:

  • VirusTotal’s antivirus engines are commandline versions, so depending on the product, they will not behave exactly the same as the desktop versions: for instance, desktop solutions may use techniques based on behavioural analysis and count with personal firewalls that may decrease entry points and mitigate propagation, etc.
  • In VirusTotal desktop-oriented solutions coexist with perimeter-oriented solutions; heuristics in this latter group may be more aggressive and paranoid, since the impact of false positives is less visible in the perimeter. It is simply not fair to compare both groups.
  • Some of the solutions included in VirusTotal are parametrized (in coherence with the developer company’s desire) with a different heuristic/agressiveness level than the official end-user default configuration.

There is, however, an issue that Gerard doesn’t mention. By using random samples from, it assumes that all samples from that source are equally valid. In other words, that they’re not only valid and validated malware, but that they’re samples which ‘should’ be detected using default settings, or at least the settings used in the scanner versions used by VirusTotal. In fact, that ignores the significant class of threat that the AV industry usually calls Possibly Unwanted – stuff that may not be referred to as unequivocally malicious, but that the customer probably wouldn’t want on his/her system anyway. Most products make that an option rather than a default because it reduces their exposure to legalistic manoeuvres: a lot of AV industry resources are tied up in mischievous litigation (or threats of litigation) by people pushing undesirable software. Making detection of PUAs/PUPs/PUS an option pushes some of the responsibility back to the customer by forcing them to make a conscious configuration choice, but it reduces an AV vendor’s own attack surface, which allows it to free up more resources for dealing with unequivocally malicious software. Well, that’s the hope…

Of course, it may be that the samples used in this case are all perfectly valid in that sense, and that there are no flaky samples (corruptions, FPs): we simply don’t have that information. But you can’t make that assumption on the basis of‘s own assessment of its utility and capabilities:

“Initially was created to link domains that were serving the same executable. What I found out in a very short period of time is the binaries are updated so frequently that this becomes almost impossible. Storing the MD5 is still useful just not as useful as I originally thought. The only purpose is to store and keep track of domains that host malicious binaries.”

Still, moving on and assuming that the quality of the samples/validity of the hashes doesn’t affect Carbon Black’s two main hypotheses…

Does the hypothesis that 43 scanners are more ‘effective’ than a single scanner hold up to scrutiny? Well, I haven’t subjected it to statistical analysis myself, and obviously I’m not about to accept Carbon Black’s analysis uncritically and unequivocally. But it does make sense, subject to reservations about what ‘effective’ actually means. It certainly doesn’t mean loading 43 products onto a single system; I’m not sure it means acting immediately on a single company’s ahead-of-the-curve detection that turns out to be a false positive; and it doesn’t necessarily mean having detection for every one of the hundreds of thousands of unique binary samples that find their way into an AV lab on a daily basis. (By unique binary sample, I don’t mean a program totally discrete from all other malware: I mean a sample packaged so that it’s different to every other sample, not least in it has a different hash value. The base code isn’t necessarily any different, but the use of packers and obfuscators may entail a different or modified detection algorithm.)

Back in my NHS days, I was told by someone with far more influence on NHS security policy than I ever had that it didn’t matter which AV product(s) the NHS used because they all detect much the same range of threats. (By all, I assume he was referring to the mainstream commercial products.) Well, yes and no. Disregarding the fact that malware sample glut has increased dramatically since those days, it’s reasonable to assume that any one mainstream product will have access to and detection for day zero samples that other products won’t have: someone has to see a specific malicious binary first. And because mainstream players (not only vendors but testers) have multiple channels for sharing information, hashes and binaries, detection of significant, active threats cascade through the industry. (In general, we don’t prioritize competitive advantage over the welfare of the community as a whole. Hopefully.) So detection of a high-prevalence or high-impact or high-profile threat (not all malware falls into all three categories, of course, but one is probably enough to ensure prompt sharing) can usually be seen, if we must use VT as a metric, to go from one or two to double figures in a very few hours.

If you follow Carbon Black’s suggestion of ‘leveraging’ VT reports, you can probably get ahead of the curve, though that isn’t a risk-free strategy. (Short term FPs, for instance.) However, what that wouldn’t get you is detection of “all malicious samples on day one!” Embarrassing though it may be for our marketing departments,   Stuxnet and its siblings have proved pretty conclusively that the entire security industry can completely miss a significant threat for extended periods. It’s probably safe to assume that there are threats that are never detected by any product. That’s bad news because there’s a false expectation that all badware will eventually be detected by AV – “If you were able to run all AVs together your systems would have been 100% secure”. (Though I’d say myself that if anyone really thinks that any security solution will give them 100% protection against all malware, let alone all security threats, I’ll be happy to sell them Tower Bridge and I’ll even throw in a couple of cathedrals.) But it’s not as bad as it sounds because the assumption that all those binaries are equally significant is unfounded. Let’s take just a few scenarios:

  1. Email-borne malware that spreads very far, very fast. (Even then, its significance is partly dependent on its payload).
  2. A highly targeted threat that only ever appears on one site.
  3. A threat that spreads with a degree of promiscuity, but only triggers when it finds itself on one particular site (or in an environment with very specific characteristics…)
  4. A drive-by that serves a different morph every few minutes, or even every time it triggers. So there’ll be instances where it isn’t served at all, or served to a platform where it isn’t able to execute for one of many reasons (attempts a patched exploit, inappropriate OS or OS version, code bugs etc).
  5. Malware that spreads far and wide but can’t trigger (intendeds, corruptions)

If you compare these in terms of significance, you’re not just comparing apples and oranges. You’re comparing sheep, castor oil and ball bearings. The bottom line is this:

  • No security software will detect (or protect from) every malicious binary
  • In an indeterminate number of cases, it doesn’t matter, since the malware doesn’t infect anything, or doesn’t or can’t trigger, or has no perceptible impact. There will be instances where undetected malware does have a significant impact – if it only happens to one system in the whole world, it’s still significant to the owner of that system – but the impact on the world as a whole is probably much less than the study suggests.
  • AV software is no more or less effective than it was before the CB studies. If people want to add Carbon Black to their armoury fair enough (that isn’t an endorsement: I haven’t looked at that service, only at the CB analyses) – it may provide an extra layer of defence, though I wouldn’t advise using it as a substitute for AV (and nor, I think, would VT).

Let’s look at one more aspect of the CB study:

“Let’s assume that a single signature can detect 100 malware variants. If so, one would have to write 7,835 signatures per day just to handle the 783,561 malicious samples being reported. These signatures will accumulate over time, requiring an AV to check each newly created file against an ever-growing list of signatures, which dramatically slows a user’s machine down to a crawl.

As a result, AVs must keep their signatures small and relevant, perhaps needing to remove an old signature for each new one added. Although we can’t guarantee this to be the case, it’s certainly a valid hypothesis as to why certain AVs detected fewer of our samples on day 30 than on day 1.”

Modern products are indeed highly generic in their approach to detections (signatures, if you insist, though that term is misleading). Actually, many detection algorithms are capable of detecting many more variants and subvariants  than 100 (we’re talking lots of zeroes here, in some cases…) But we don’t just add a detection for each processed sample, we modify a detection as necessary – of course, a good heuristic will sometimes detect many unknown samples without needing immediate modification. And sometimes a highly generic detection will be superseded for certain samples by a more specific detection, as more information is gathered on that particular threat family. But I doubt if any mainstream vendor pulls a detection within 30 days of an initial detection just to make room for another detection. If a vendor stops detecting something, it’s likely to be something else entirely: recognition of a false positive, reclassification (for instance as Possibly Unwanted), or even a process error.

I think I feel a paper coming on. ;-)

ESET Senior Research Fellow