corneliabl: Bacteria Phylogeny: Facing Up to the Problems

There are millions of species of bacteria. Sorting out their evolutionary history has been a major challenge for decades. Unlike the much bigger, multicellular, eukaryotes, there are few morphological markers to assist scientists in classifying bacteria. The fossil record is mostly silent.

Molecular evolution came to the rescue thirty years ago when cloning and sequencing became common. Soon there were elaborate and detailed phylogenetic trees based on comparing sequences of conserved genes from many species.

The gene of choice was the one for the small subunit ribosomal RNA (SSU rRNA). This gene was well conserved in bacteria and it was easy to get sequences simply by PCR. (The ends of the SSU rRNA gene are conserved and this means that you can develop universe primers for PCR.)

Over the years, the SSU rRNA gene has become what is called the "gold standard" in bacterial phylogeny and taxonomy. Many species have been assigned to taxa based entirely on the sequence of their SSU rRNA gene. Unfortunately, the "gold standard" has become somewhat tarnished lately.

Our fellow blogger, Jonathan Eisen of The Tree of Life, has recently published a paper that looks at the problems with bacterial phylogeny (Wu and Eisen, 2008). He posted a brief summary of the paper and commented on why he likes the journal Genome Biology [Happy Open Access Day: Back to Genome Biology for Me].

There is much to like about this paper. The authors face up to the problems with the current bacterial phylogeny, which is based almost entirely on a single gene (SSU rRNA). They point out that this is risky given what we know about molecular phylogenies. Furthermore, in the case of the SSU ribosomal RNA gene we know for a fact that this has led to problems and inconsistencies. In addition to the practical difficulties there are good theoretical reasons for being suspicious of phylogenies constructed from nucleotide sequences.

What to do? One possible solution is to abandon SSU rRNA as a "gold standard" and replace it with a highly conserved protein coding gene. Unfortunately, this doesn't get around the problem of relying on a single gene. The way around this is to use an artificial concatenated sequence made up of several different conserved genes laid out end-to-end in one large string of amino acids.

So why isn't this done? Because, as Wu and Eisen point out, it ain't that easy. The main difficulty in any phylogenetic study is getting a proper alignment. This is a problem that many workers simply ignore when they use automated alignment software like CLUSTALW. These workers assume that the alignments are valid.

They aren't, and this is another example of facing up to the problem. Many scientists agonize over what program to use when constructing their trees—should they use maximum likelihood, parsimony, etc. etc.? In most cases these decisions are a complete waste of time because their alignments aren't good enough to make a difference.

Here's how Wu and Eisen explain it ...

It has been shown that alignment quality can have greater impact on the final tree than does the tree-building method employed [20]. Therefore, preparing high quality sequence alignments is a most critical part of any molecular phylogenetic analysis. This preparation typically involves careful but tedious manual editing and trimming of the generated alignments, and thus remains the biggest challenge to automation. When scaling up this process, the trimming step is often simply ignored. Automated trimming based on the number of gaps in each column or each column's conservation score can be used to select conserved blocks, but still is not satisfactory when a high quality tree is required.

Keep in mind that what is being proposed is a large tree based on concatenated sequences from many genes. You don't want to do multiple sequence alignments for every gene by hand, and yet up until now, that was the only way to get accurate results.

Wu and Eisen have written a program called AMPHORA that hopefully solves this problem. They begin by manually creating "seed alignments" that are manually curated. Then they use AMPHORA to align all the other sequences to the seed alignments. In this way they hope to overcome the limitations of automated multisequence alignment without having to align everything by hand.

None of this would be possible, of course, unless there were large numbers of species where every one of the target genes have been cloned and sequenced. In the 20th century this would have been impossible but now there are hundreds of completely sequenced bacterial genomes. This means that each one of them has a sequenced copy of the genes required for this kind of analysis.

All that's left is to identify the completely sequenced genomes and pick the set of genes. There are 578 genomes in the database but many of these are close relatives that will not be useful in constructing a large tree of all bacterial sequences. The final set contains 310 genomes with representatives of all the major groups.

The authors selected 31 genes for their initial proof of principle paper (dnaG, frr, infC, nusA, pgk, pyrG, rplA, rplB, rplC, rplD, rplE, rplF, rplK, rplL, rplM, rplN, rplP, rplS, rplT, rpmA, rpoB, rpsB, rpsC, rpsE, rpsI, rpsJ, rpsK, rpsM, rpsS, smpB, tsf). Those of you who recognize these genes will see that 21 of them are small ribosomal proteins. This was not the best choice, in my opinion, but the authors of the paper note that they are continuing the study by incorporating better genes such as HSP70 (dnaL) and EF-Tu (tufA). You can't just choose any conserved gene because it has to be present in most species and there are surprisingly few genes that meet that criterion.

After all that, what's the bottom line? The grand phylogeny is shown at the top of this posting. It resolves many groups that are unresolvable using the SSU rRNA tree. In some cases this tree reveals species that have been incorrectly assigned to higher taxa. These species will have to be reclassified if this result holds up.

The most important finding is that the method works and it yields trees with excellent resolution of the major bacterial taxa.

Wu, Martin, Eisen, Jonathan (2008). A simple, fast, and accurate method of phylogenomic inference Genome Biology, 9:R151 [Genome Biology] [doi:10.1186/gb-2008-9-10-r151]

corneliabl

Tuesday, October 14, 2008

Bacteria Phylogeny: Facing Up to the Problems

No comments:

Post a Comment