Thursday, September 5, 2013

The Importance of Sequence Alignments

There are several required steps in constructing phylogenetic trees from sequence data. The first step is to align the sequences so you can make direct comparisons. It used to be the case that multiple sequence alignments had to be checked manually because none of the available computer programs were as good as an experienced scientist. That hasn't changed. What's changed is that the data sets have become so large and complicated that nobody wants to even look at the sequence alignments to see if they can be improved.

Drew et al. (2013) suggest that sequence alignments should be made available.
Until recently, uploading sequences to GenBank (or EMBL) was generally considered sufficient to ensure reproducibility of phylogenetic studies using DNA sequence data. Increasingly, however, the systematics community is realizing that archiving raw DNA sequences is not adequate, and that the underlying alignments of DNA sequences as well as the resulting phylogenetic trees are pivotal for reproducibility, comparative purposes, meta-analyses, and ultimately synthesis. Indeed, there has been a growing clamor for journals to adopt and enforce more rigorous data archiving practices across diverse disciplines [4]–[8]. As a result, about 35 evolutionary journals [5],[9] have adopted policies to encourage or require authors to upload alignments, phylogenetic trees, and other files requisite for study reproducibility [5] to TreeBASE (http://treebase.org/) and/or other public repositories such as Dryad (http://datadryad.org). Unfortunately, enforcement of such data deposition policies is generally lax, and most journals in systematics and evolution still do not require DNA sequence alignment or tree deposition. As a result, the alignments and trees underlying most published papers in systematics/phylogenetics and evolutionary biology remain inaccessible to the scientific community at large [8],[10].
I sympathize with the goal but I doubt that it can be achieved. I strongly suspect that many scientists don't even bother to produce sequence alignments. They just feed the electronic data directly into their tree-making algorithm.

I wonder how many anomalies could be resolved if they just looked at the alignments? Would they even know if bad sequence data was being used for one or two species in their alignment?


Drew, B.T., Gazis, R., Cabezas, P., Swithers, K.S., Deng, J., Rodriguez, R., Katz, L.A., Crandall, K.A., Hibbett, D.S., and Soltis, D.E. (2013) Lost Branches on the Tree of Life. PLoS Biol 11(9): e1001636. [doi: 10.1371/journal.pbio.1001636]

No comments:

Post a Comment