Decoding the Human Genome by Computer-aided dN/dS Analysis

by Andrew Brockman, Wenping Lyu

PREFACE:The first Human Genome was sequenced and assembled in 2003. This was a tremendous breakthrough. Or was it? If your answer to that question is either: “No” or “I don’t know”, then hopefully wecan change your mind by decoding the human genome.

Central to the discipline of evolutionary biology, is the detection of protein-coding genes,undergoing the process of “adaptive evolution”. The dN/dS ratio (a.k.a Ka/Ks) is one of the most widely used metrics in statistical tests: to verify the types of natural selective pressures a gene experienced oversome periodof evolutionarytime. dN/dS analysis promises to unlock the hidden evolutionary stories, that Nature had forged into an organism’s DNA sequence. Why is it so widely used? We attempt to answer this in four parts: 1) The Human Genome, a Sealed Book. 2) What is the dN/dS Ratio? 3) Why is it Useful to Society? 4) How can Supercomputers help?

The Human Genome, a Sealed Book

If we play with the analogy that evolution is a kind of“story”, then genomic DNA molecules (gDNA) are the ink, and paper, in which (much of) this story is written.gDNA molecules are like “books” embedded with many genes, stitched in tandem. mRNA molecules are like transcribed “messages” within the gDNA book. Each mRNA is copy-pasted version of a single gene. mRNA molecules instruct the cell to make protein molecules (Fig.1), which are like blocks for building bigger “machines”: tissues, organs, etc.RNA.png

Figure 1: mRNA instruct the cell to make protein from gDNA

 

Biomolecules work cooperatively, in the collective interest of their survival and continued inheritance (see: Dawkins).So, what exactly is inherited? The answer is gDNA. The whole DNA alphabet consists of only four letters: A, G, C and T, shorthand for the nucleobases: Adenine, Guanine, Cytosine and Thymidine. The RNA alphabet differs slightly, instead of T, they have U (Uracil). Surely this is too simple? Something as complex as the Human brain, made from the instructions of just four letters?! Every Human cell contains some3 billion pairs of these letters. Meaning, each and every cell in your body, stores the latest surviving edition of a “DNA book”, whose original copy was written3.8 bn years ago... Back on topic: type ATGCTATATAATA into Google Translate (Fig. 2). Google detected Maori, but that’s not very meaningful.

KEY: dN/dS ratio can extract meaning from DNA sequences.

Figure 2. The Google translation of the sequence “ATGCTATATAATA.

What is the dN/dS ratio?

DEFINITION: dN/dS analysis converts a pair of protein-coding DNA sequences into an estimate of the no. of mutations that effect the protein’s function (N) vs. no. of mutations that do not (S), over some period of divergence (Fig. 3).

Output (e.g. NG)

Natural Selective Pressure (on gene)

Dominant Evolutionary Dynamic

dN/dS ratio ≈ 1

Neutral

Genetic Drift

dN/dS ratio > 1

Positive

Adaptive Evolution

dN/dS ratio < 1

Negative

Evolutionary Conservatism

Figure 3. Diagrammatic drawing of dN/dS analysis.

Why is dN/dS analysis useful?

Scientific inquiry begins with a question. Hypotheses can then be formulated as possible answers.Which in turn are falsified through experimentation, and statistical testing (see: Karl Popper). dN/dS analysis uses statistical rigour to falsify hypotheses about the evolutionary history of DNA sequences. We personally find these questions fascinating:

  • Life: what causes speciation? (find out) Why are these extinct? Should we even worry? (yes)
  • Sex: some species reproduce sexually, others asexually, why? (find out), peacock’s tail, the purpose? (find out),
  • Aging: why do we age? (find out) Do bacteria age? (find out) Plants? (find out) Archaea? (find out)
  • Humans: why do cancer cells divide without limit, yet normal body cells do? (find out)
  • Mammals: have mammary glands, yet, only the platypus and echidna lay eggs, why? (find out)  
  • Arthropods: why do insects metamorphose? (find out) Why are honeycombs hexagonal? (find out)
  • Parasites: malaria spreads by mosquitoes, why can’t HIV? (find out) While, Toxoplasma infects all? (find out)
  • Fungi: gave us antibiotics (Penicillium), biological weapons (Anthrax), plastics of the future: Mycelium, and this one looks like Buckminsterfullerene (Clathrus): why, why, and why?

Not all these questions have clear answers, and perhaps a few do not even make sense. Thinking about evolution, as the “story of life”, allows us to explain things: to make sense of living phenomena. Through principles such as genetic inheritance, mutation, natural selection and drift. For the curious mind, there’s a certain pleasure in finding things out from the evolutionary perspective.

Industry uses it, medics use it, agrarians use it, and conservation ecologists use it. To develop long-lasting vaccines against pathogens, to monitor the spread of resistance genes, and to save endangered species from their evolutionary “stories” from ending. Pathogens, parasites and pests are in a constant evolutionary arms race with pharmaceutical drugs companies, chemical insecticide industries and agricultural pesticide manufacturers - to name a few. You cannot swallow dN/dS ratios as you would an antibiotic pill, it doesn’t have that kind of tangibility. But tools like dN/dS analysis can, and do, save lives, crops and species. But unlike antibiotics: theory, software and hardware improve over time: it’s much harder for evolution to adapt to such tools.

How can supercomputers help?

The Human Genome Project completed in 2003. It took mankind 3.8 bn years, and $3.8bn, to sequence 1 Human genome. 12 years later, in 2015one of us was given the great privilege to work with an incredible team, on a sequencing effort, that led to the publication of 16 new Anopheline genomes. Now, millions of human genomes have been sequenced (Fig. 4a). Decoding these millions of sealed books by dN/dS analysis requires enormous computational effort.

Not too long ago we didn’t even have the luxury of a calculator, let alone a computer. A scientist’s tools were pencil, paper, chalk, blackboards and education. In schools, we learnt how to convert inputs (e.g. 2 x 2), into output (e.g. 4), step by step. Put them together, et voila: an algorithm. Recently, we optimised the Nei-Gojobori algorithm, for calculating dN/dS ratios quickly (GitHub). But fast software is not enough. We need both fast software and fast computers (Fig. 4b), to hear the stories hidden in Life’s code, and save lives along the way.

HPCperf.png

Figure 4. Time evolutions of the cumulative number of human genomes (a) and the performance of modern supercomputers (b).