perl scripts for
phylogenetic analyses


Degeneracy Coding: Degen v1.4

The problem of nucleotide compositional heterogeneity
The reconstruction of evolutionary relationships between ancient lineages using highly divergent, protein-coding DNA sequence data can be hampered by nucleotide compositional heterogeneity. Current versions of software frequently used to infer molecular phylogenies (e.g., MrBayes, RAxML, Garli, PAUP*) do not account for compositional heterogeneity across taxa, with divergences from homogeneity resulting in signals that can conflict, but occasionally concur, with the phylogenetic signal inferred from the sequence of nucleotides.

Potential workarounds
This problem has received increased attention in publications, and approaches to alleviate the problem range from the traditional exclusion of third codon positions to novel nucleotide models that aim to account for heterogeneous nucleotide compositions. We have developed two simple approaches termed "Degen" and "noLR" that operate at the data matrix level and permit conventional data analyses under various optimality criteria and models as implemented in any of the commonly used software packages. This is advantageous for the interpretation of results and support values in as much as the software packages and algorithms have been tested extensively and great amounts of experience have been accumulated. A further key strength of these two approaches are the low computational requirements, which are reflected in much lower memory consumption and shorter calculation times as compared to other, more complex approaches. This is particularly important in light of ever increasing data set sizes of multi-gene phylogenetics and phylogenomics.

The "Degen" approach
The "Degen" PERL script (""; author: A. Zwick & A. Hussey, see script) greatly reduces the analytical problems introduced by nucleotide compositional heterogeneity between taxa. "Degen" operates by degenerating nucleotides to IUPAC ambiguity codes at all those sites that can potentially undergo synonymous change in any and all pairwise comparisons of sequences in the data matrix, thereby making synonymous change largely invisible and reducing the effect of compositional heterogeneity but leaving the inference of non-synonymous change largely intact.
In the current version of "Degen" (v1.4), the "standard genetic code" (default) and other genetic codes for protein-coding genes are implemented.

The actual script ("") is available for download, but is also made available as a web service for more convenient testing of the degeneracy coding approach. For more background information and details, see the documentation. For comments and questions, please contact Andreas Zwick at