Generalist Anti-Malware Product Testing

We have just come across a Buyer’s Guide published in the March 2010 issue of PC Pro Magazine, authored by Darien Graham-Smith, PC Pro’s Technical Editor. The author aims to give advice on which anti-malware product is the best for consumer users, and we  acknowledge that the article includes some good thoughts and advice, but it also contains several significant methodological flaws: in fact, we were a little taken aback at some of the testing methodologies used. It seems that all the testing was performed exclusively in-house, and we think that if the testing was conducted by a specialist testing organization with years of experience focused primarily on objective anti-malware testing, the results arrived at might well be very different and would be more convincing. We would like to respectfully point out some problematic assumptions and methods used in the March issue.

When testing the product’s detection, namely its ability to protect against threats, flawed methodology was used. As an example, we can pull a quote:

“Every file has been positively identified as dangerous by at least four packages, so a good suite should detect most of them.”

This seems ok, right? But wait: there was no direct validation as to whether the samples constitute actual malware or not, i.e. whether they were validated as malicious or innocent.

There are at least two false assumptions here. The first is that you can validate samples accurately, simply by running them past one or more scanners and seeing if they detect them. Well, Mr. Graham-Smith is correct in thinking that he reduces the risk of false positives by requiring at least four scanners to identify each sample as malicious. However, he doesn’t eliminate it. It’s by no means unknown for an incorrect detection to be cascaded from one vendor to others if those vendors don’t re-validate them. As more vendors move towards an “In the Cloud” model of detection by reputation, driven by the need to accelerate processing speed, it’s easy for a false positive to spread, at least in the short term. At least some of the files identified could have gotten into the testing sample from a database provided by one or more of the vendors and was subsequently falsely detected by the heuristics as a virus.

However, there’s an even greater problem.

When a detection test uses default installation and configuration options, as was done in this test, it’s particularly important that samples are not only correctly identified, but also correctly classified. This is because all scanners do not treat all classifications of malware in the same way. While all scanners take similar approaches to out-and-out malicious programs such as worms and viruses, bots, banking Trojans and so on, there are other types of application, such as some examples of adware, that can’t be described as unequivocally malicious.

Similarly, some legitimate programs may use utilities such as packers and obfuscators, and it’s not appropriate to assume that all anti-malware products treat such programs in the same way. Some assume that all such programs are malicious, but others discriminate on the basis of the code that’s present, not just on the presence of a packer. These “grey” applications and ambiguous cases may be classified as “Possibly Unwanted”, “Potentially Unsafe”, or even “Suspicious”.

Unlike many professional testing organizations, PC Pro does not consult with vendors about such issues as configuration before a test, and it does not give “missed” samples from its tests to the publishers of the products it tests. However, and to his credit, Darien Graham-Smith quickly responded to a request for further information with a list of file hash values for the samples he says we missed (18 out of 233), and in all cases but one, the detection name given to it by one of our competitors. (A file hash such as an MD5 uses a cryptographic function to compute a value for a file that is unique to that least in principle. In fact, it is possible – though very rare – for two files to have the same hash value – we call this a hash collision.) This enabled us to check our own collection for files corresponding to the sample set used by PC Pro.

When checking samples that the magazine claims we missed, we found some anomalies in the samples set. The random nature of the sample selection (including such oddities as a Symbian Trojan, an anomalous file version of a 1989 boot sector virus, packer detections, a damaged sample detection, and a commercial keylogger) gives serious cause for concern. We even found samples detected by some of our competitors by names like “not-a-virus: RemoteAdmin.PoisonIvy”. With fuzzy classifications like these, it’s unsurprising that many of these cases are not detected by default by all scanners. But where such samples are used, as was the case here, the accuracy of the test is compromised, since it introduces a bias in favour of products that don’t discriminate between possibly malicious and unequivocally malicious applications.

The Anti-Malware Testing Standards Organization (AMTSO – http://www.amtso.org) was established in May 2008 with the exact intention of reducing unprofessional testing, skewed methodologies and resultant flawed results. Its status is strictly that of an international non-profit association focused on addressing the universal need for improvement in the objectivity, quality and relevance of anti-malware testing. Principle 5 of the AMTSO document “Fundamental Principles of Testing” (http://www.amtso.org/amtso—download—amtso-fundamental-principles-of-testing.html) states:

Testers must take reasonable care to validate whether test samples or test cases have been accurately classified as malicious, innocent or invalid.

It has often been the case in the world of Antivirus testing that seemingly reliable testing results were, in fact, not valid, because the samples used in the tests were misclassified. For example, if a tester determines that a product has a high rate of false positives, that result could be inaccurate if some samples were wrongly classified as innocent. Thus, it is our position that reasonable care must be taken to properly categorize test samples or test cases, and we especially encourage testers to revalidate test samples or test cases that appear to have caused false negative or false positive results.

Similarly, care should be taken to identify samples that are corrupted, non-viable or that may only be malicious in certain environments and conditions.

Yet another question that arises with regard to PC Pro and its testing methodology is the small sample size of 233 used in the test, and how the files were obtained. As the PC Pro validation of the test samples did not meet professional standards, there is no way any authoritative conclusions can be drawn from this test, as far as the products’ detection is concerned.

The other detection testing method used by PC Pro was a dynamic test of web threats. The methodology of dynamic testing of infected websites is very problematic to say the least (http://www.amtso.org/amtso—download—amtso-best-practices-for-dynamic-testing.html). We borrow a PC Pro citation to illustrate this:

“For this month’s web-based test, we visited several hundred dodgy-looking websites. We identified 54 of them as potentially malicious, because those pages caused at least one security product to throw up an alert.”

This is problematical, in that it suggests an immediate bias in that the validity of single product alerts is assumed without question.

It also has to be said that the web changes constantly, which means that web-hosted threats also change. So a question arises: Has the tester used 15 parallel computers to test all PCs and solutions against a single site, serving the same malware, at exactly the same moment? Only if this principle was upheld can consistent results be ensured for each tested product.

The method used here seems very questionable: malware loaded on the web may change at very short intervals and so may be different with every time it’s accessed. Moreover, the tester has failed to validate the websites as really malicious. And yet he goes ahead and draws conclusions regarding the performance of these tested products based on these questionable parameters. In the methodology used, the author fails to identify which web sites are dangerous, harmless, or even offline.

We will shortly address problems in the test's methodology as regards product performance other than raw detection testing in another blog. We have also asked Pierre-Marc Bureau and David Harley for more information on their expert analyses of the sample set used.

Ján Vrabec
Security Technology Analyst, ESET

Author David Harley, ESET

  • Hello, Yegor. You didn’t really think we’d let you advertise Dr. Web here, did you?

  • Yegor

    OK, I'm sorry. So what about cleaning? ESET's cleaning technology needs to be hardly improved, don't you think? Why so bad (av-comparatives.org/comparativesreviews/removal-tests)?

    • I’m aware of that report: in fact, I had an interesting discussion with Andreas Clementi on cleaning a few months ago. It’s actually a test area that merits closer inspection,
      but I don’t really have time or space to discuss it here and now. I think you’re reading a bit too much into that report. There isn’t generally much differentiation between the good and average performance bands, and I’m not convinced that the size of the sample set and the evaluation of the cleaning criteria allows the reader to draw significant conclusions, in most cases.

      I’d agree in principle that AV shouldn’t leave remnants, but the fact is that as detection becomes more generic/heuristic, it becomes more difficult to guarantee removal of every last remnant for every one of the tens of thousands of unique binaries that turn up on a daily basis. Up to the turn of the century (admittedly, long before I joined an AV company) I was pretty scathing about the issue, but the threatscape was very different then. Now, I’m far less concerned _unless_ remnants actually cause visible problems after disinfection. I’m certainly not about to panic about an “advanced” score based on a set of ten samples.

  • Yegor

    What about this (anti-malware-test.com/?q=node/180)?

  • Randy Abrams

    A 16 sample test is hardly a comprehensive test. With such a small set you can mainulate the test to get whatever results you like for any product.

  • Yegor
    • Randy Abrams

      Credit to them for publishing their methodology. Still, 16 is a very small sample set. Complexity is not an indicator of prevalence as well. Additionally, the test was done with AV loaded on an infected system and does not indicate if the user would have avoided infection with the tested products when they encountered the threat.

  • Yegor

    Randy, the AVC removal test and the AM active infections treatment test IV have shown poor cleaning capabilities of NOD32. Therefore i, an ESET fan, would like to know whether you plan to improve the cleaning technology. Just answer “Yes” or “No”.

    • Randy Abrams

      Assuming the test were done properly, the test shows problems with a small set of samples. This is not a large enough test to draw conclusions about the overall comparative cleaning abilities of different products. And yes, our developers are constantly working on improving the cleaning capabilities. Do note, however, if your computer is infected with a rootkit and security is important to you, then you’ll be wiping the hard drive and reinstalling the OS. Detection of a rootkit does not assure detection of anything esle that may have been installed while the machine was compromised.

  • Yegor

    Rootkits in the test are dangerous malwares, not to mention the fact that there are hundreds modifications with ability to block the OS and kill any security software. And the test shows problems with widespread samples.
    P.S. Well, good luck in developing the cleaning ability and, of course, HIPS.

    • Randy Abrams

      Also not mentioned was if they used the feature of NOD32 and Smart Security to make a bootable CD to scand and clean with a rootkit inactive.

  • Yegor

    By the way, can (or will) ESET SysRescue delete registry entries of ? malware? What do your corporate customers to do if there are thousands infected machines with ESET AV installed?

  • Randy Abrams

    Hi Yegor,
    I'm not sure what registry entries may or may not be deleted with SysRescue. Registry entries themselves are generally pretty harmless, except occasionally they innoculate aginst a threat, and sometimes they are changed by malware and cause systems to not execute some commands properly. Running SysResuce should delete any detected malware as it will no longer be ale to hook the system… it is inert at this point.
    Customers with thousands of machines that are infected would contact their support representatives for assistance.
     

Follow us

Copyright © 2016 ESET, All Rights Reserved.