How to Screw Up and Skew a Test

Even as AMTSO attempts to bring some qualified and competent guidance to testing methodologies, and individuals with an agenda or paranoia invent stories about why it is not good, we see more completely incompetent testing.

I refer this time to the test that Steve Ragan wrote about at The test performed by Cyveillance, who describe themselves as cyber intelligence company, is a text book example of how to do antimalware testing fundamentally wrong.

To begin with a sample set of 1,708 UNCONFIRMED malicious files were used. We see around 200,000 unique new samples each day at ESET alone. 1,708 is not a statistically significant sample set. Cyveillance claims to have confirmed the samples, but their methodology proves otherwise. The way you confirm a sample is malicious is by examining what the sample actually does. Cyveillance discarded any samples that were not detected by 3 antivirus companies. There are two interesting aspects to this approach. For one, if a harmless file is detected by 3 scanners (false positive) and, correctly, not detected by the rest of the scanners, then the scanners with the false positives are rewarded and the scanners with accurate detection are penalized. All of the scanners would have said the EICAR test file was infected, but the EICAR test file is not malware at all. Many scanners will detect harmless files because they are encrypted using a method that malware often, but not exclusively, uses. Detection by three scanners is not a reliable way to determine if a file is malicious, just as a lack of detection by at least 3 scanners is not a reliable way to determine a file is not malicious. There have been numerous cases where far more than 3 antivirus products on VirusTotal  have had false positives and numerous times that less than 3 scanners on VirusTotal correctly identified a real threat! Concluding the validity of a sample, based upon the results of 3 scanners, is one of the most amateur mistakes any tester ever makes.

Discarding the samples that only one or two scanners detect can seriously skew the detection rates of the one or two scanners than did identify a threat correctly. In essence if a product detected 1,000 out of 2,000 samples (50%), but 500 samples were discarded because only that scanner and one other scanner detected it, then the 50% detection rate is incorrectly reported as 25%. This is only for one product, but what about the other products that may have detected dozens of discarded samples? The numbers across the board would be skewed, that’s what happens. The completely amateur and mistaken methodology for determining sample inclusion makes their test highly unreliable.

What if Cyveillance had used scientifically sound methodology? In that case the test would have been an interesting report as to the effectiveness a microcosm of on-demand scanning effectiveness, but still not a good snapshot of the big picture. Remember at ESET alone, we are discovering 200,000 unique new threats each day. That is 6 million threats in a 30 day period and does not include the threats our competitors find that we do not detect. 1708 samples is less than 3/100ths of a percent of the new malicious samples we see in a 30 day period at ESET alone. This is a very small sample set and is also a onetime snapshot that cannot warrant extrapolation to a norm or average.

An interesting sentence in the report struck me a naïve. “If the malware writer can practically ensure the malware will go unrecognized upon release, then this means that a critical question of each security suite’s performance is the speed at which it “catches up” with the latest threats”. IF? IF? What are they thinking? The bad guys have access to all of our product. Yes, some of them definitely can and do test to make sure that their malware is completely undetected by on demand scanning at the time of release. There is no question about that, but on demand scanning is only a part of the security defenses that an antimalware product has in its arsenal, so this type of testing does not give a real world picture of protection except for those users who are using only on-demand scans. In the real world the heuristic algorithms and differences in settings than may be used in real time scanning versus on-demand, as well as other components of a security product may very well have prevented the sample from getting onto the computer in the first place. That is why quality, dynamic testing is required to give a more comprehensive picture of protection. Quality dynamic testing is still not available from any organization that both uses a statistically valid sample set and exposes their methodology and samples so that their work can be scientifically validated. NSS labs, a critic of AMTSO, shuns scientific scrutiny of their dynamic testing, the implication being that their methodology would fall like a rock.

Steve Ragan also reaches the mostly correct conclusion that “If their report figures show anything, it is that signature-based detection alone isn’t enough, and keeping current on security software updates is the best defense a system can have.” Keeping current on security software updates is an important defense, but the best is a multilayer approach that also includes education.  Let’s imagine for a moment that Cyveillance is way off. Let’s imagine that actual zero-day detection is on average 75%. This would still show that signature-based detection alone isn’t enough. Honestly, I don’t believe that signature based zero-day detection is anywhere close to 75%. It is very possible that for zero-day detection the average is much lower than the 19% average reported, or somewhat higher, but the immense flaws in the methodology of this test offer no insight into the real world.

So, if the Cyveillance test does not support Ragan’s conclusion, how can I say his conclusion concerning signature-based detection is correct? The combined experience of the support centers for the antimalware companies throughout the world support Ragan’s conclusions with real world (yet generally proprietary) numbers. Oh, believe me, AV companies would be far more profitable if they did not have to support so many infections due to missed malware! Not only is signature-based detection alone not enough, but even adding in heuristics is not enough. Still, antimalware is a tool that users currently have available to help mitigate risk.

AMTSO is attempting to establish guidelines and information that helps people to be able to avoid these common and obvious mistakes. None of the critics of AMTSO I have seen have ever actually attacked any of the information or guidelines that AMTSO publishes, rather they rely exclusively on innuendo and paranoia. In other words, they will not directly confront the work of AMTSO. From what I have seen, NSS Labs, owned by former ESET Marketing executive Rick Moy, is a marketing, rather than science driven organization. Most of the attacks against AMTSO have been based on the fact that there are employees of antivirus companies in AMTSO. The actual output of AMTSO has been something that NSS Labs have left undisputed. In attacking AMTSO NSS Labs rarely if ever acknowledges the membership of antivirus test organizations who also want better testing standards.

Kevin Townsend is another critic of AMTSO. Townsend, in addition to showing the paranoia about vendors, laments that AV product users are not involved in AMTSO. I could do a whole blog about why users are not qualified to be involved ( but should not be excluded and are not excluded). Seriously, if a professional security organization such as Cyveillance can make such obviously novice errors, how do you expect the average user to have knowledge to contribute to the establishment of scientific standards? It would be great if Cyveillance joined AMTSO, or at least read some of the documentation on the site.

Cyveillance has a conclusion section of their report. Pretty much all of it but the first paragraph is spot on. The first paragraph reads:

As illustrated in the preceding section, the average time it takes for the top AV solutions to
catch up to new malware threats ranges from 2 to 27 days. Given that this metric does not
include malware files still undetected even after thirty days, the AV industry has much room
for improvement.

It is definitely true that the AV industry has much room for improvement, but the flaws in the testing methodology mean that the Cyveillance test has enormous room for inaccuracy and is not a scientific illustration of the problem. Do read the rest of their conclusion as well though, there is some really sound advice despite invalidity of their test.

Not to knock Cyveillance, they may very well be good at many things, but right now, scientific AV testing is a seriously weak point for the company.

Randy Abrams
Director of Technical Education

Author , ESET

  • BDJ

    While the methodology does appear to be flawed, this statement you made is incorrect: "1,708 is not a statistically significant sample set"  You can use any number of statistics references to find that 1,708 is an excellent statistical sample size for a set of 6,000,000 (the number you referenced later for new items in a 30 day period).  It provides a margin of error of just over 3% with a confidence level of 99%.

  • Securater

    Who has heard of Cyveillance? Have they provided any proof or pedegree that they have the background and expertise to conduct valid scientific tests? Therein lies my first clue as to the validity of the results. I agree that we need AMTSO standards. But these new standards should consider other techniques & solutions. 
    Behavioral analysis, host intrusion prevention systems, whitelisting, sandboxing, intrusion detection and squid filtering of HTTP traffic are hardly considered these days in the current comparative tests. These types of technologies should be considered because there is much more to detection in the real world and in real time than just Signatures and Heuristics.
    By publishing flawed tests, Cyveillance has only added to the confusion surrounding computer security. What we need are real world standards and reliable people speaking the truth. Good work!

  • Randy Abrams

    BDJ, I believe the statistical modeling you use to asses a margin of error of 3% with a confidence level of 99% is based upon finite population statistics, which a professional statistician has told me is not appropriate for this type of testing. Additionally, statistical reliability estimates presume reliable polling (sample collection) methodology which is clearly not the case here.

  • Derek Foster

    I just read a similar article on regarding testing by NSS Labs which didn't place ESET Smart Security in a very favorable light.  Have you had a chance to review that report for accuracy?
    The article can be found here:

    • Randy Abrams

      NSS Labs wants people to believe they are independent. They carefully state that they don’t take money from vendors for testing. NSS Labs eagerly takes money from AV vendors, they simply say it isn’t for testing that they are taking the money.bESET doesn’t give NSS Labs money.

      NSS labs also does not share enough information for any 3rd party to replicate their tests, they don’t want people to be able to show the flaws. NSS Labs is not at all transparent. The owner is a marketing professional, not an IT person.

  • Tua

    what d u think about the NSS Lab Report?

    • David Harley

      As far as I know, the situation with NSS hasn’t changed since Randy’s observation in the article.

Follow us

Copyright © 2017 ESET, All Rights Reserved.