perl scripts for
phylogenetic analyses


Character Exclusion: noLR / LeuArg1 v1.3

The problem of nucleotide compositional heterogeneity
The reconstruction of evolutionary relationships between ancient lineages using highly divergent, protein-coding DNA sequence data can be hampered by nucleotide compositional heterogeneity. Current versions of software frequently used to infer molecular phylogenies (e.g., MrBayes, RAxML, Garli, PAUP*) do not account for compositional heterogeneity across taxa, with divergences from homogeneity resulting in signals that can conflict, but occasionally concur, with the phylogenetic signal inferred from the sequence of nucleotides.

Potential workarounds
This problem has received increased attention in publications, and approaches to alleviate the problem range from the traditional exclusion of third codon positions to novel nucleotide models that aim to account for heterogeneous nucleotide compositions. We have developed two simple approaches termed "Degen" and "noLR" that operate at the data matrix level and permit conventional data analyses under various optimality criteria and models as implemented in any of the commonly used software packages. This is advantageous for the interpretation of results and support values in as much as the software packages and algorithms have been tested extensively and great amounts of experience have been accumulated. A further key strength of these two approaches are the low computational requirements, which are reflected in much lower memory consumption and shorter calculation times as compared to other, more complex approaches. This is particularly important in light of ever increasing data set sizes of multi-gene phylogenetics and phylogenomics.

The "noLR" approach
The "LeuArg1" PERL script (""; authors: A. Hussey & P. Donohue, modified by A. Zwick, see script) greatly reduces nucleotide compositional heterogeneity between taxa by eliminating all characters that encode synonymous changes, which are the by far largest source of compositional heterogeneity in protein-coding data sets. Synonymous changes occur largely at third codon positions, and, in the case of Leucine and Arginine, also at the first codon position, but never at the second codon position. The "noLR" approach is based on the removal of entire characters, i.e., all third codon positions and all those first codon positions that encode Leucine or Arginine for at least one (user choice: at least two; at least three;...) sequence are removed for all sequences. Unlike "Degen", this approach also excludes substantial amounts of non-synonymous information at first and third codon positions.
In the current version of "LeuArg1" (v1.3), only the "standard genetic code" for nuclear, protein-coding genes in animals is implemented. The script requires a FASTA or FLAT file as input and produces text files that list all those first codon positions that do (LRall1) or do not (noLRall1) potentially encode for Leucine or Arginine.

These lists of codon positions are the basis for character sets that can be included or excluded in analysis software, e.g., PAUP*. For example, to obtain noLRall1+nt2 as used in our publications (no third codon positions, no potentially Leucine or Arginine encoding first codon positions), one could create character sets for LRall1 and nt3, which can be eliminated with the exclude command in PAUP*.

The actual "LeuArg1" script ("") is available for download, but there are currently no plans to make a web service available as "Degen" oftentimes generates stronger node support given its larger number of informative characters. More background information and details can be found in the documentation and in the citation publication, as well as within the script itself.

For comments and questions, please contact Andreas Zwick at