Carbon Dating and Malware Detection

[Update: John Leyden’s own article on the topic is now up here. (Actually, it’s been there for a while: I’ve just been a bit busy!]

The Register‘s John Leyden drew my attention to research by Carbon Black, a company marketing a host-based intrusion prevention system, indicating that if  an AV package hasn’t added detection for malware within six days of its first being detected by another company, the chances are it still won’t detect the  sample 30 days later. Carbon Black reached this conclusion after tracking the detection rates of 43 products for 84 random malware samples on the VirusTotal website.

The Carbon Black study has methodological drawbacks which will affect its accuracy. (That doesn’t mean its conclusions are completely wrong in this case, but I’ll get to that in a minute.) In fact, the study  acknowledges one of those drawbacks, but draws conclusions based on those statistics anyway.

“As was pointed out when we conducted this study the first time, individual AV results vary based on configuration. Also, we did not include any of VirusTotal’s new sandboxing results in the most recent study so the results, just like the previous study, are limited to static signatures.”

Actually, while that statement does encapsulate a real problem, it slightly overstates it. The study is only limited to static signatures in the case of products that only use sandboxing/active heuristics/behaviour analysis in realtime scanning. The advantage of emulation is that it isn’t restricted to on-access/realtime scanning. An on-demand scanner can analyse the behaviour of a program dynamically because it runs it in a safe (emulated) environment. (I’m assuming that R.M. Gerard is using ‘static signature’ in this second Carbon Black article in the sense of ‘restricted to static analysis’ rather than in the sense of an old-school signature based on a static search string rather than a more sophisticated algorithm.)

Still, equating VT statistics with a product’s detection can be seriously misleading, and Julio Canto of Hispasec/VirusTotal and I have previously addressed the issue in a joint paper though we were particularly concerned about the implications for quasi-testing, and Gerard avoids that particular trap. Both the paper and VT’s ‘about’ page quote a highly relevant article from Bernard Quintero that summarizes the ‘product performance versus VT reporting’ problem succinctly:

  • VirusTotal’s antivirus engines are commandline versions, so depending on the product, they will not behave exactly the same as the desktop versions: for instance, desktop solutions may use techniques based on behavioural analysis and count with personal firewalls that may decrease entry points and mitigate propagation, etc.
  • In VirusTotal desktop-oriented solutions coexist with perimeter-oriented solutions; heuristics in this latter group may be more aggressive and paranoid, since the impact of false positives is less visible in the perimeter. It is simply not fair to compare both groups.
  • Some of the solutions included in VirusTotal are parametrized (in coherence with the developer company’s desire) with a different heuristic/agressiveness level than the official end-user default configuration.

There is, however, an issue that Gerard doesn’t mention. By using random samples from, it assumes that all samples from that source are equally valid. In other words, that they’re not only valid and validated malware, but that they’re samples which ‘should’ be detected using default settings, or at least the settings used in the scanner versions used by VirusTotal. In fact, that ignores the significant class of threat that the AV industry usually calls Possibly Unwanted – stuff that may not be referred to as unequivocally malicious, but that the customer probably wouldn’t want on his/her system anyway. Most products make that an option rather than a default because it reduces their exposure to legalistic manoeuvres: a lot of AV industry resources are tied up in mischievous litigation (or threats of litigation) by people pushing undesirable software. Making detection of PUAs/PUPs/PUS an option pushes some of the responsibility back to the customer by forcing them to make a conscious configuration choice, but it reduces an AV vendor’s own attack surface, which allows it to free up more resources for dealing with unequivocally malicious software. Well, that’s the hope…

Of course, it may be that the samples used in this case are all perfectly valid in that sense, and that there are no flaky samples (corruptions, FPs): we simply don’t have that information. But you can’t make that assumption on the basis of‘s own assessment of its utility and capabilities:

“Initially was created to link domains that were serving the same executable. What I found out in a very short period of time is the binaries are updated so frequently that this becomes almost impossible. Storing the MD5 is still useful just not as useful as I originally thought. The only purpose is to store and keep track of domains that host malicious binaries.”

Still, moving on and assuming that the quality of the samples/validity of the hashes doesn’t affect Carbon Black’s two main hypotheses…

Does the hypothesis that 43 scanners are more ‘effective’ than a single scanner hold up to scrutiny? Well, I haven’t subjected it to statistical analysis myself, and obviously I’m not about to accept Carbon Black’s analysis uncritically and unequivocally. But it does make sense, subject to reservations about what ‘effective’ actually means. It certainly doesn’t mean loading 43 products onto a single system; I’m not sure it means acting immediately on a single company’s ahead-of-the-curve detection that turns out to be a false positive; and it doesn’t necessarily mean having detection for every one of the hundreds of thousands of unique binary samples that find their way into an AV lab on a daily basis. (By unique binary sample, I don’t mean a program totally discrete from all other malware: I mean a sample packaged so that it’s different to every other sample, not least in it has a different hash value. The base code isn’t necessarily any different, but the use of packers and obfuscators may entail a different or modified detection algorithm.)

Back in my NHS days, I was told by someone with far more influence on NHS security policy than I ever had that it didn’t matter which AV product(s) the NHS used because they all detect much the same range of threats. (By all, I assume he was referring to the mainstream commercial products.) Well, yes and no. Disregarding the fact that malware sample glut has increased dramatically since those days, it’s reasonable to assume that any one mainstream product will have access to and detection for day zero samples that other products won’t have: someone has to see a specific malicious binary first. And because mainstream players (not only vendors but testers) have multiple channels for sharing information, hashes and binaries, detection of significant, active threats cascade through the industry. (In general, we don’t prioritize competitive advantage over the welfare of the community as a whole. Hopefully.) So detection of a high-prevalence or high-impact or high-profile threat (not all malware falls into all three categories, of course, but one is probably enough to ensure prompt sharing) can usually be seen, if we must use VT as a metric, to go from one or two to double figures in a very few hours.

If you follow Carbon Black’s suggestion of ‘leveraging’ VT reports, you can probably get ahead of the curve, though that isn’t a risk-free strategy. (Short term FPs, for instance.) However, what that wouldn’t get you is detection of “all malicious samples on day one!” Embarrassing though it may be for our marketing departments,   Stuxnet and its siblings have proved pretty conclusively that the entire security industry can completely miss a significant threat for extended periods. It’s probably safe to assume that there are threats that are never detected by any product. That’s bad news because there’s a false expectation that all badware will eventually be detected by AV – “If you were able to run all AVs together your systems would have been 100% secure”. (Though I’d say myself that if anyone really thinks that any security solution will give them 100% protection against all malware, let alone all security threats, I’ll be happy to sell them Tower Bridge and I’ll even throw in a couple of cathedrals.) But it’s not as bad as it sounds because the assumption that all those binaries are equally significant is unfounded. Let’s take just a few scenarios:

  1. Email-borne malware that spreads very far, very fast. (Even then, its significance is partly dependent on its payload).
  2. A highly targeted threat that only ever appears on one site.
  3. A threat that spreads with a degree of promiscuity, but only triggers when it finds itself on one particular site (or in an environment with very specific characteristics…)
  4. A drive-by that serves a different morph every few minutes, or even every time it triggers. So there’ll be instances where it isn’t served at all, or served to a platform where it isn’t able to execute for one of many reasons (attempts a patched exploit, inappropriate OS or OS version, code bugs etc).
  5. Malware that spreads far and wide but can’t trigger (intendeds, corruptions)

If you compare these in terms of significance, you’re not just comparing apples and oranges. You’re comparing sheep, castor oil and ball bearings. The bottom line is this:

  • No security software will detect (or protect from) every malicious binary
  • In an indeterminate number of cases, it doesn’t matter, since the malware doesn’t infect anything, or doesn’t or can’t trigger, or has no perceptible impact. There will be instances where undetected malware does have a significant impact – if it only happens to one system in the whole world, it’s still significant to the owner of that system – but the impact on the world as a whole is probably much less than the study suggests.
  • AV software is no more or less effective than it was before the CB studies. If people want to add Carbon Black to their armoury fair enough (that isn’t an endorsement: I haven’t looked at that service, only at the CB analyses) – it may provide an extra layer of defence, though I wouldn’t advise using it as a substitute for AV (and nor, I think, would VT).

Let’s look at one more aspect of the CB study:

“Let’s assume that a single signature can detect 100 malware variants. If so, one would have to write 7,835 signatures per day just to handle the 783,561 malicious samples being reported. These signatures will accumulate over time, requiring an AV to check each newly created file against an ever-growing list of signatures, which dramatically slows a user’s machine down to a crawl.

As a result, AVs must keep their signatures small and relevant, perhaps needing to remove an old signature for each new one added. Although we can’t guarantee this to be the case, it’s certainly a valid hypothesis as to why certain AVs detected fewer of our samples on day 30 than on day 1.”

Modern products are indeed highly generic in their approach to detections (signatures, if you insist, though that term is misleading). Actually, many detection algorithms are capable of detecting many more variants and subvariants  than 100 (we’re talking lots of zeroes here, in some cases…) But we don’t just add a detection for each processed sample, we modify a detection as necessary – of course, a good heuristic will sometimes detect many unknown samples without needing immediate modification. And sometimes a highly generic detection will be superseded for certain samples by a more specific detection, as more information is gathered on that particular threat family. But I doubt if any mainstream vendor pulls a detection within 30 days of an initial detection just to make room for another detection. If a vendor stops detecting something, it’s likely to be something else entirely: recognition of a false positive, reclassification (for instance as Possibly Unwanted), or even a process error.

I think I feel a paper coming on. ;-)

ESET Senior Research Fellow

Author David Harley, ESET

  • Tommi

    Hi David,
    very interesting article, but I'd assume the link to owa… is not intended in the first quote "AV results vary based on configuration", is it?

    • David Harley

      Thanks, Tommi. Indeed it wasn’t. Moving to a different machine has turned up a few interesting quirks. :-/

  • Brian

    OK so maybe VT results aren't an accurate way to measure detection.  From my observation though it seems that AV vendors are still having a very difficult time detecting "new" samples for which they have not yet analyzed or developed a signature for.  AV vendors also seem to have a very difficult time detecting exploit files.
    I think this is troubling partly because of warnings and notices that come from AV vendors that might say something like "We detect threat-X in update yyzz" but what that really means is that they detect the samples that they have collected and for the most part, the malware just needs to be repacked or rebuilt to evade detection.
    I'm intrigued by your last paragraph that mentions detection algorithms and that the term signature may be misleading.  I feel that in spite of the capabilities of modern products, they tend to fail more often than they succeed when it comes to new samples.
    I' understand that VT may not be the best tool to conduct research like this.  How do you think ESET and competitors would have performed in a more representative test though?

    • David Harley

      Well, by definition we can’t detect a sample we haven’t developed a signature/detection for. :) We do have generic/heuristic detections that detect some samples we’ve never seen, but we can never detect all the malicious binaries we’ve never seen. Or anything like all of them. That’s the name of the game I’m afraid: if a security vendor tells you that its product detects all known and unknown samples, shake your head pityingly and walk away.

      Exploits are a different ballgame. AV – well, some AV – often detects exploits. Especially Microsoft exploits, since MS is pretty good nowadays about sharing info about known vulnerabilities ahead of patches being available. Other vendors, obviously, are more problematical. However, I don’t think you can compare AV to a specialist vulnerability scanner. And if you’re thinking about a certain bungled test, think again. :)

      You’re right up to a point, though: when a vendor says ‘we detect threat-X’ they mean they detect all the samples they currently know of. They can’t promise to detect all obfuscated or tweaked samples of the same base code.

      The term signature is misleading (and always has been) because it encourages people to think in terms of static strings and/or static analysis. Actually, even those approaches are algorithmic: it’s just that looking for a static string isn’t a very sophisticated algorithm. :) Frankly, I don’t know what the overall proportion of detected to non-detected samples is for a given product, let alone for the whole gamut of AV products. And for the reasons discussed in the blog, it’s even harder to assess whether it really matters for a given example.

      VT is not suitable for research like this, and I don’t think this qualifies as a test. It gives me a headache thinking about how you could conduct a realistic longitudinal research project that would give you accurate data. I suspect that if you could adjust realistically for prevalence, appropriate classification, risk and impact and so on, products would tend to do better, but proving it would be an interesting challenge.

  • Brian

    "Well, by definition we can’t detect a sample we haven’t developed a signature/detection for." <- this is exactly the problem and the reason that people often complain about the effectiveness of anti-malware products.  Most malware that gets on to a machine (based on my experience) is "new".  Also in most cases, if it is detected at all, it will be after several days, not when the malware was dropped as I think many people expect.
    You sometimes blog about people questioning the usefulness of anti-malware software.  I think there is good reason to question it.  It doesn't detect exploits reliably, it doesn't detect new samples very well and many vendors say they detect and protect against a threat like Zeus or Poison Ivy but what they really mean is that they can only detect the samples that they have collected which is somewhat misleading.  On the other side of the argument are AV software advocates that might suggest that becauase it is sometimes successful, it is worth having (and investing time and money into).
    "I don’t think you can compare AV to a specialist vulnerability scanner." <- I don't expect an anti-malware solution to detect vulnerabilities but I think it is reasonable to expect it to detect malicious files like pdf, office and java exploits.  I don't think it is necessary to have vulnerability details from the vendor of the vulnerable software – it would be far easier to collect samples of the exploits and develop detection that way (which is how I guess most vendors do it).
    So the file-based scanning that VT uses may not be entirely representative.  How different would the results be if someone were to conduct the same type of test with a bunch of Windows desktops running full versions of the EPP software?  Based on your response that you can't detect a sample that you don't have a detection signature for, I'm inclined to believe that the results would be pretty similar (or at least not substantially different).  Maybe some solutions might have better heuristic detection but I doubt that would be typical in a default configuration. 
    Interesting challenge for sure.  I think it would be worth investing some time into this as welll.  Thanks a lot for your thoughtful response.

    • David Harley

      I’m afraid you misunderstand me. What I’m saying is not that we can’t detect anything we haven’t already seen, but that we can’t detect what isn’t covered by an existing detection. An indeterminate percentage of unknown malware is detected by existing signatures. As for reliability, you’re preaching to the converted. I’ve never said that AV is anything like 100% reliable, even in the days when malware was much rarer and detection rates were much higher. The problem is that I have yet to see an alternative that is 100% reliable, or anything like, without hampering business processes. I do think that AV is a useful option as part of a multilayered defence strategy in the enterprise. (Home users may find it more convenient to use an internet security suite, though they can also mix and match components if they know what they’re doing.) AV isn’t the only option, and it’s a long way from being complete protection, but if you’re going to give up on it, you need to suggest a viable – in fact, better – alternative.

      In fact, vulnerability details from a vendor are a more reliable source of data than samples picked up in the field, and there’s a channel of communication between MS and the mainstream security industry for that purpose. Of course we can and do base detections on exploitative malware found in the wild, but that’s haphazard and can be resource-intensive.

      How would EPP make the samples more representative? (Achieving that is the whole problem with AV testing in a nutshell, and I have yet to see a fully satisfactory methodology that works in a 21st century threatscape.) Aiming to replicate the approach without addressing that is a blind alley.

      And I’m not sure that my view of what constitutes heuristics and mine coincide. The days when aggressive heuristic scanning was cautious and strictly optional because we were terrified of FPs are pretty much behind us. Cloud-based detection is heavily reliant on generics. Not that FPs aren’t still a problem…

Follow us

Copyright © 2017 ESET, All Rights Reserved.