If you were hoping to find a sensational story outing one of our competitors, I am going to disappoint you right away. This is not that, but it is something, something that can happen to all of us!

Spam is something that everyone encounters.

Spammers are constantly looking for new ways to get their garbage to you, bypassing your spam filters.

So far, this is no different from any other cat-and-mouse game in cybersecurity. Although there are some extremely good antispam solutions available, even the best are bypassed from time to time, and will need to adapt their rule sets to guard against the latest spam techniques.

As with antimalware products, antispam products are included in comparative tests conducted by several independent and objective testers. Testing is a lucrative business for the testing organizations, and expensive for the security vendors being tested, so it should come as no surprise that these vendors want to achieve the best results.

That by itself creates a new marketing model: commercial vendors trying to sell feeds of spam samples to both testers and security vendors. One could argue that antispam vendors willing to buy the feeds (or freely consuming a feed if the supplier wants, for commercial or other reasons, to have its feed look important) will have an unfair advantage with those testing their products, but that is not the scope of this blogpost.

Recently ESET was confronted with a tester that started to consume a new commercial spam feed to complement its existing antispam test bed.

When I and other ESET researchers started to analyze that feed, we were astonished. Not only because the samples in the commercial spam feed were not classified (who decides what is ham or spam? Can you read all languages to determine that?), but also by the high noise ratio – there were many legitimate messages – that is, they were “ham”, not spam! On top of that, when analyzing those ham messages, we found many with personal (and personally identifiable) as well as confidential information: (personal) pictures, copies of driver licenses, credit card information, and so forth. How did these legitimate emails end up in a “spam” feed?

The key here is Parked Domains and Sinkholed Domains. At a basic level, the latter are domains typically under the control of anti-DDoS services, law enforcement or researchers so their operators can alleviate or monitor nefarious or malicious activity, usually by directing (some) network traffic for these domains into the bit bucket or to systems under their control. Parked Domains are domains that people register, usually for far-from-legitimate purposes, with domain names that give the user the idea that they are going to a supposedly legitimate site, e.g. my-bank-new-card[.]com. Or the domains look a lot like legitimate domains but are a typo away from them, often referenced as Typosquatted Domains, as e.g. oulook[.]com instead of outlook[.]com. Sometimes, as in fraud cases, this is done so, for example, phishing spam with apparently legitimate URLs can be sent; in other situations, to collect emails/data from people who make a typo in an email address. Such scams are usually very short-lived, so the criminals behind them register these domains for just 12 months (the usual minimum) and do not renew their registration. Shortly after registration lapses, anyone can then (re)register such a domain name, install an email server for it, and start collecting all email sent to the domain, both spam and legitimate email messages intended for the correct, original domain.

The vendor of the aforementioned spam feed collects all emails sent to Parked and Sinkholed Domains and supplies the emails to security vendors and testers.

Of course, no one can prevent people from sending emails with private, confidential information to the wrong email address, such as one on a typosquatted domain, other than the senders themselves. To be honest, you cannot even really blame anyone for doing this, as they intended to send that information to the right address… and they probably thought it was sent correctly since they did not receive a bounce message!

However, the ethics of selling a spam feed that includes such messages “as spam” is dubious as those messages clearly are not all spam.

What is spam? A common definition is bulk, unsolicited email that is usually commercial in nature.  Bulk – say no more, these messages are not sent in bulk. Almost all are certainly ever sent only once and to only one address (well, maybe two addresses – the original and the correct ones once/if the sender realizes the mistake). Yes, it’s unsolicited in a sense, but that alone does not make it spam, and arguably anyone setting up mail servers on such domains to collect all possible received email is only doing so because they are soliciting for exactly such email – that is, they want to receive these messages, so they are not really unwanted or unsolicited.

Beside this technical issue of these messages simply not being spam, it creates an ethical and moral issue. The owners of these parked domains have surely not obtained the consent of the original senders to use or sell their email messages - certainly not those with the private, confidential information - for this purpose. Insofar as these feeds include any such messages sent by EU residents, providing such a feed with the absence of key elements of data processing “lawfulness, fairness and transparency” seems likely to be a violation of GDPR. We are also curious about the compliance with other principles relating to processing of personal data, such as purpose and storage limitation as well as confidentiality of data being included in privacy and data retention policies of this feed vendor.

When a tester uses such a feed as a part of its test bed, the problem is exacerbated.

For validation purposes, bona fide testers supply “misses” to the producers of the software they test, in order to confirm the misses. At that moment, antispam developers will receive (and thus store) the missed “spam” samples. Without proper legal grounds, any activity other than deletion and notification to the tester and feed vendor might lead to GDPR violation, regardless of the location of the storage or offices of the product developer.

Further, these “missed” samples may cause issues for a vendor’s antispam product. Machine learning (ML) algorithms are widely used in antispam products, and adding such legitimate messages to your “spam” set is likely to make any ML-based classifications of previously unseen email messages less accurate, thus putting customers of antispam products at greater risk. Storing this kind of data is actually not something we want. Upon making this discovery, ESET deleted from our spam database all samples sourced from this feed.

ESET, of course, contacted the tester, who quickly and correctly removed the feed from the then-current test, while investigating our findings. Later the tester informed us that the feed had been investigated, our findings confirmed, and that it had discarded this test-feed completely.

The provider of the commercial feed also has been contacted; at the time of publication no reply had been received.

All security advice aside, there is no remedy to prevent these kinds of data leakage other than common sense: verify the email address twice and then twice more before you send any sensitive data to it. Not just to be certain that you did not make a typo in it, but also to ensure that the email address is still in use by the organization to which you are sending the data. Tools such as 2FA, a password vault, and so on are all useless in this scenario because for all their ability to protect your identity, they cannot protect you from sending email to the wrong address.