perl scripts for
phylogenetic analyses


Character Exclusion: noLR / LeuArg1 v1.3

                             LeuArg1 v1.3
           April Hussey, Andreas Zwick & Jerome C. Regier
                 University of Maryland (AH), USA
    University of Maryland Biotechnology Institute (AZ, JCR), USA
     Comments or questions about this script should be sent to
            Dr. Andreas Zwick at
README file for, version 1.3
Last updated: 5 FEB 2010
Latest version available at:
*  Please acknowledge the use of this script in your publications   *
*     by citing:                                                    *
*  Regier, J.C., Shultz, J.W., Ganley, A.R.D., Hussey, A., Shi, D., *
*  Ball, B., Zwick, A., Stajich, J.E., Cummings, M.P., Martin, J.W. *
*  & Cunningham, C.W. (2008).                                       *
*  "Resolving arthropod phylogeny: Exploring phylogenetic signal    *
*  within 41 kb of protein-coding nuclear gene sequence.            *
*  Systematic Biology 57: 920-938.                                  *
1) Introduction
2) Requirements and installation
3) Usage
4) How it works
5) Versions
6) History of development
7) License
8) Acknowledgments
The reconstruction of evolutionary relationships between ancient
lineages using highly divergent, protein-coding DNA sequence data can
be hampered by nucleotide compositional heterogeneity. Current versions
of software frequently used to infer molecular phylogenies (e.g.,
MrBayes, RAxML, GARLI, PAUP*) do not account for compositional
heterogeneity across taxa, with divergences from homogeneity resulting
in signals that can conflict, but occasionally concur, with the
phylogenetic signal inferred from the sequence of nucleotides. The PERL
script "LeuArg1" aids in the phylogenetic analysis of highly divergent
DNA sequence data by greatly reducing nucleotide compositional
heterogeneity between taxa. The key observation that "LeuArg1" exploits
is that most compositional heterogeneity resides in sites that undergo
synonymous change. A standard approach to reducing this problem for
phylogenetic analysis is simply to eliminate all characters at the
third codon position (nt3), leaving characters at the other two codon
positions (nt1, nt2) intact. "LeuArg1" goes one step further by
generating a list of all nt1 characters that encode one or more leucine
and arginine residues. Leu and Arg codons are the only ones that can
undergo synonymous change at nt1. This list can then be used to
specifically remove those characters, for example, by defining a
character set in PAUP* and then excluding that character set. With this
character set and nt3 removed, what remains is a matrix that largely
undergoes nonsynonymous change. As implemented, LeuArg1_v1_3 uses only
the standard genetic code for protein-coding nuclear genes in animals.
For more background information and details, see section 4, "HOW IT
Data requirements:
        - nucleotide sequences must conform to the "standard genetic
          code" [protein-coding nuclear genes in animals]:

        - sequences must begin at the 1st codon position (nt1) and end
          at the 3rd (nt3), and consist only of complete codons
        - any indels, represented by dashes in the data matrix, must be
          triplets or their multiples and in-frame, i.e., located
          between nt3 and nt1 of adjoining codons
        - sequence data must be in non-interleaved FASTA or FLAT file
          format [sequence identifier preceded by ">" or "#" on 1st
          line followed by entire nucleotide sequences on subsequent
        - the data file has to be a multi-sequence alignment
System requirements:
        - any operating system (e.g., Linux/UNIX, Mac OS-X, Windows)
          with a functional installation of a PERL language interpreter
          (e.g., http:/; type "perl -V"
          in a shell to check for a PERL installation
There is no need to install the script other than to copy it to a
directory of your choice. To use the script directly from any
directory, place the script in a location that is included in your PATH
variable or adjust the PATH variable accordingly.
The script expects only a single command-line parameter, namely, the
data input file:
It is best called with your PERL interpreter, but the actual command
might vary depending on your operating system. A ".fasta" or ".flat"
suffix is not needed for the script to operate correctly.
An example of how to use the script correctly under Linux:
                perl ./ mydata.fasta
The output consists of multiple files [named "LRx", "LRallx", "noLRx"
and "noLRallx", where x is a non-zero integer], all in a single folder.
The output does not overwrite the original input file.
The "LeuArg1" script reads individual DNA sequences as strings of
codons, in which there are three sequential nucleotides per codon (nt1
nt2 nt3). Counting from the 5' end of the sequence, the script
identifies every codon that encodes one or more leucine or arginine
residues and places their nt1 character numbers in a text file. In the
so-called "LR1" file are identified those nt1 characters that encode
leucine or arginine residues for one or more taxa. In the "LR2" file
are identified those nt1 characters that encode leucine and/or arginine
residues for two or more taxa, and so on. The number of "LRx" files is
equal to the maximum number of taxa with leucine and/or arginine codons
in any character in the entire data matrix. (The theoretical maximum
would be equal to the number of taxa.) Also placed in separate files
and called "noLRx" are the nt1 characters that complement each of the
"LRx" files. For example, "noLR1" identifies those nt1 characters that
encode no (<1) leucine or arginine residues, i.e., nt1 minus "LR1".
"noLR2" includes those nt1 characters that encode no more than one (<2)
leucine and/or arginine residues, i.e., nt1 minus "LR2". "noLR3"
includes those nt1 characters that encode no more than two (<3) leucine
and/or arginine residues, and so on. A slight and typically
insignificant modification of the above scheme has also been
implemented that modifies the manner in which intra-specific
polymorphic or ambiguous sites for leucine and arginine are treated. In
the "LRx" files, only nt1 characters that unambiguously encode leucine
and arginine are included. Thus, for leucine CT[anything], TTR and
YT[A, G, R] are included. Similarly, for arginine CG[anything], AGR and
MG[A,G, R] are included. By default, characters not placed in a "LRx"
file are placed in the corresponding "noLRx" file. In an alternative
partitioning scheme, polymorphic nt1 characters that encode leucine +
phenylalanine and/or arginine + serine are included in the so-called
"LRallx" files, in addition to those also found in "LRx". Specifically,
nt1 characters from the following polymorphic codons are listed in
"LRallx" files: CT[anything], TT[anything except Y], YT[anything],
CG[anything], AG[anything except Y] and MG[anything]. Any characters
not identified in "LRallx" are present in "noLRallx". The "LeuArg1"
script generates all of these files in a single directory that is named
by appending "_Data" to the input filename. Although in our
phylogenetic analyses, we have experimented with including increasing
numbers of leucine and/or arginine residues (e.g., LR2, LRall2, etc.),
we now restrict our analyses to characters lacking leucine and
arginine. The files generated by the "LeuArg1" script contain PAUP*-
compatible character sets ("charsets") that allow us to conveniently
define and "exclude LRall1" (or, alternatively and equivalently:
"include noLRall1 /only") from our data matrices. With "LRall1" and nt3
excluded, we are left with a so-called "noLRall1 + nt2" character set.
In pairwise comparisons of sequences in a "noLRall1 + nt2" data matrix,
no synonymous change can be directly inferred. It should be noted that
in excluding nt3 and "LRall1", some nonsynonymous change is also
eliminated, unlike the situation with "Degen" coding.
The latest version of the "LeuArg1" script and this README file can be
downloaded at
Version 1.1: 03 FEBRUARY 2006
Version 1.2: 14 MAY 2009
Version 1.3: 05 FEBRUARY 2010
In 2001, a manual version of what the LeuArg1_v1_3 script does was
applied to a phylogenetic data set by Jerome Regier [see Regier, J.C. &
Shultz, J.W. (2001). A phylogenetic analysis of Myriapoda (Arthropoda)
using two nuclear protein-encoding genes. Zoologica Journal of the
Linnean Society 132: 469-486.]. In 2006, April Hussey, then an
undergraduate researcher in Regier's lab, and Jerome Regier discussed
the feasibility of automating this manually laborious process. April
Hussey and Paul Donohue then wrote the first script, as well as version
1.2. It was first published in: Regier, J.C., Shultz, J.W., Ganley,
A.R.D., Hussey, A., Shi, D., Ball, B., Zwick, A., Stajich, J.E.,
Cummings, M.P., Martin, J.W. & Cunningham, C.W. (2008). "Resolving
arthropod phylogeny: Exploring phylogenetic signal within 41 kb of
protein-coding nuclear gene sequence". Systematic Biology 57: 920-938.
[doi:10.1080/10635150802570791] In 2010, Andreas Zwick created a
modified version 1.3 that treats alignment gaps and missing data more
This program is free software: You can redistribute it and/or modify it
under the terms of the GNU General Pubic License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version. This program is distributed in the hope that
it will be useful, but WITHOUT ANY WARRANTY; without even the implied
the GNU General Public License for more details.
Development of the "LeuArg1" scripts (through February, 2010) was
funded by grants from the U.S. National Science Foundation
(Biocomplexity in the Environment: Genome-Enabled Environmental Science
and Engineering program, Award no. DEB-0120635; Assembling the Tree of
Life program, Award no. 0531626).