Malware coded into synthetic genomes have caused skepticism

When I began researching this topic towards the end of 2013, I sensed a certain skepticism from the scientific community, particularly when people with different backgrounds started experimenting between disciplines, which can reveal new vectors of IT security attacks.

In late 2015, when I presented my Master's thesis in IT security on Malware that infects genomes* I experienced that skepticism up close. During the revision process, one of the professors, who was a specialist in molecular biology, branded it as “erudite nonsense.” In his opinion, it was obvious that a DNA sequence could be modified for malicious purposes and that it was the researcher's duty to verify that what was sequenced matched the originally published sequence. I do not disagree with this point of view, but beyond the many scenarios that open up in terms of security, it is difficult to explain how easy it would be for some of the checks to fail, particularly if the problem lies in the software. The simple fact that this could occur warranted further study, in my opinion.

Nevertheless, his perspective was not without grounds. My biological scenarios were merely theoretical, given that I did not have the resources to synthesize/sequence a modified genome and demonstrate a real case. Without this, it was difficult to verify the feasibility of a genome being compromised with malicious information in such a way that, if synthesized, it could be passed into the biological realm, carrying an arbitrary sequence, and then be sequenced and compromise the system. Furthermore, it wasn't something we could see ‘in-the-wild’, but technically that didn't mean it couldn't happen one day.

And then, that day came.

Professor Tadayoshi Kohno and his team from the University of Washington managed to demonstrate it in their article published last week: "Computer Security, Privacy, and DNA Sequencing: Compromising Computers with Synthesized DNA, Privacy Leaks, and More.”

Kohno and his team carried out in-depth, detailed research into the subject, where they put into practice this theoretical scenario which I was wondering about too: “maliciously” modified DNA can be synthesized and sequenced, giving rise to the execution of arbitrary code. In this case, they created a vulnerability in an application called fqzcomp to demonstrate the code's execution.

"Establishing whether or not they belong to the structure of the sequence may be no trivial matter."

However, there are many different possibilities. In my work, for example, there was a simple script that parsed the FASTA file (which contains the genome's information and is written using the four nucleotides: adenine, cytosine, thymine, and guanine) to decrypt and execute the “payload.” It wasn't an elegant solution, and also it required the victim to be vulnerable in order to execute the script; therefore I wasn't fully satisfied, but it did the job. To encode the string into the sequence, the procedure was similar to the biological process, whereby these four bases (A, C, T, and G) are grouped into triplets forming what are referred to as codons (which represent amino acids and are then translated into proteins).

This means you can take the groups of three as a basis and then code a symbol for each triplet, forming a “hidden” alphabet. In this case, ASCII was used, and the coding took the following form: ACA = “A”, ACC = “B”, ACG = “C,” and so on successively (there are various ways to code the message; this is just one example). As you can see, we have 4^3 combinations, so we can quite easily code the entire alphabet in uppercase, lowercase, numbers, and symbols, and we still have spares after covering the 64 possibilities. This system offers a way to “write” arbitrary code inside a genome. Naturally, you could write quotes, as J. Craig Venter did when he created a cell controlled by a synthesized genome, or inject malware or arbitrary code.

What kind of impact could this cause?

Below, I include a portion of my thesis that analyzes the potential scenarios that could be discussed.

"The impact of this type of attack could be classified as: digital, digital-biological, and biological.

Digital impact: The fact that a malicious payload can be injected into a DNA sequence does not imply that this methodology aggravates the infection, but rather it would aggravate the complexity of identifying it and subsequently detecting it using traditional protection methodologies such as hashes to ensure integrity and solutions to detect corrupted files. For this reason, it has been demonstrated how this scenario would work in order to warn of the possible use of genome sequences as alternative vectors.
Digital-biological impact: In the event that a genome sequence is maliciously modified, and that genome is successfully synthesized, the malicious code could remain in the cell without impacting it. It should be clarified that this was not verified by the author as it falls outside the objectives of this work. If this were to happen, this organism would load some malicious code, whose DNA could then be sequenced in a laboratory and generate a sequence file that would contain, for example, a portion of malicious code. An attacker would then just need to extract it and execute it in order to activate a digital attack. (This point is similar to the one demonstrated by the University of Washington.)
Biological impact: This would be the case where a maliciously inclined person has the ability to cause a mutation in a sequence, which would have no malicious impact on the system but could set in motion a functional problem at the biological level, if it were synthesized without adequate checkpoints. (This would be a hypothetical case whose feasibility is more difficult to verify.)"

As we saw with Professor Kohno's publication last week, Scenario Two has already been addressed and demonstrated to be “feasible” under certain circumstances. Undoubtedly, it remains far from being a real threat, but it is no longer a merely theoretical problem as we imagined in the past.

In the future, could a bacterium infected with malware replicate itself?

In the hypothetical case that a piece of modified DNA has been successfully synthesized, then the malicious code could form a part of a synthetic cell capable of replicating itself autonomously in the biological realm. The malware could even be “propagated” biologically, given that bacteria inherently have all the equipment needed for reproduction. Furthermore, the malicious code would not affect the carrier cell accommodating it, but would use it to stay “alive” until its genome was sequenced in a laboratory and regained its digital form in order to then activate itself on a computer or device. However, pinpointing the correct location for this code is a complex matter if biological propagation is to succeed. Here are some of the areas where a malicious string could be inserted:

Irrelevant area: the malicious code enters an area of little importance; it is likely to have no significant impact.
Area of a gene: if it enters a gene sequence and produces a mutation, two possibilities arise: The mutation is lethal, in which case it may disappear from nature without propagating itself. Or, the mutation is beneficial or neutral, in which case the added portion may continue its propagation.
Regulating area: In this case, it could alter a gene, as in the second scenario, or it could do nothing, as in the first.

As such, in the event that it does not produce a lethal mutation, the malware and the synthetic carrier cell could form a kind of “cybernetic commensalism,” to make a simple comparison to the kind of symbiosis by which one participant obtains a benefit while the other one is neither harmed nor benefits.

In the University of Washington's research, more emphasis is placed on sequencing a piece of DNA without any biological objective, but it is not clear [to me] whether it was dismissed on grounds of feasibility or complexity. I believe that this, as much like science fiction as it sounds, could be another point to consider in the future.

Detecting malicious strings

As the information is coded into the sequence, detecting malicious strings could be a complicated procedure. This is because, regardless of whether an application is capable of identifying them, establishing whether or not they belong to the structure of the sequence may be no trivial matter, if the DNA in question has a biological objective (and has not been published) or is used to store information or for other purposes.

Conclusion

It is interesting to see that this topic is finally gaining more attention in the media and, possibly, among researchers and specialists thanks to the research done by Tadayoshi Kohno and his team. Despite the debatable elegance of the implementation — creating a vulnerability in an application — we can observe that one of the most important points from a security perspective is gaining ground: the notion of subjecting this topic to greater scrutiny in order to spark an interdisciplinary discussion of it, in which IT and bioinformatics specialists, security experts, equipment manufacturers, governments, and specialists in molecular and synthetic biology come together.

In my opinion, given the rapid speed with which sequencing devices are developing, and the dramatic reduction in costs, successfully achieving security in DNA sequences will require a lot more work than can be done by one research group and a few enthusiasts. Unfortunately, until there are real-life cases or economic losses, it is likely that we will not see anything more in the media than sensationalist articles predicting the “genome-alypse.”

It is true that the feasibility is still low and there is no reason to be alarmed, but we should also remember that with IT security, waiting for an attack to happen before finding a solution has never been a good strategy.

* This is a free translation of the original version in Spanish and has been adjusted and modified after submission

Disclaimer: Everything presented here makes no claim to be exhaustive and may contain errors, considering the interdisciplinary nature of the research and my background in computer science and not as a geneticist. Therefore comments, suggestions, and improvements are welcome in order to keep deepening and expanding this fascinating topic.