`ProtTest` is a bioinformatic tool for the selection of best-fit models of amino acid replacement for the data at hand. It uses PhyML (Guindon and Gascuel, 2003) to calculate the maximum likelihood (ML) estimates of phylogenetic trees and model parameters. Models included in ProtTest are empirical models of amino acids substitution. After obtaining the ML estimates and loglikelihood, it compares the models and find the best model according to Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) scores or Decision Theory Criterion (DT). ProtTest also provides other information about parameters and trees. I’ll show how to use ProtTest in command line in the following.

For command line execution of ProtTest, these are the parameters that can be set by user (copied from README in the package):

-i alignment_filename

alignment input file (required)

-t tree_filename

tree file (optional) [default: NJ tree]

-o output_filename

output file (optional) [default: standard output]

-S optimization_strategy

optimization strategy mode: [default: 0]

0: Fixed BIONJ JTT

1: BIONJ Tree

2: Maximum Likelihood tree

3: User defined topology

-[matrix]

Include matrix (Amino-acid) = JTT LG DCMut MtREV MtMam MtArt Dayhoff WAG

RtREV CpREV Blosum62 VT HIVb HIVw FLU

If you don’t specify any matrix, all matrices displayed above will

be included.

-I

include models with a proportion of invariable sites

-G

include models with rate variation among sites and number of categories

-IG

include models with both +I and +G properties

-all-distributions

include models with rate variation among sites, number of categories and both

-ncat number_of_categories

define number of categories for +G and +I+G models [default: 4]

-F

include models with empirical frequency estimation

-AIC

display models sorted by Akaike Information Criterion (AIC)

-BIC

display models sorted by Bayesian Information Criterion (BIC)

-AICC

display models sorted by Corrected Akaike Information Criterion (AICc)

-DT

display models sorted by Decision Theory Criterion

-sample sample_size_mode

sample size for AICc and BIC corrections [default: 2]

0: Shannon-entropy Sum

1: Average (0-1)Shannon-entropy x NxL

2: Total number of characters (aligment length)

3: Number of variable characters

4: Alignment length x num taxa (NxL)

5: Specified by the user

-size user_size

specified sample size, only for “-sample 5″

-t1

display best-model’s newick tree [default: false]

-t2

display best-model’s ASCII tree [default: false]

-tc consensus_threshold

display consensus tree with the specified threshold

-all

Displays a 7-framework comparison table

-verbose

verbose mode [default: false]

-threads number_or_threads

number of threads to compute (only if MPJ is not used) [default: 1]

The command need to be called in the directory of the folder that contains the program, but input files and output files don’t have to be in the same directory, their paths can be specified in the command line.

Several things to mention about the input files: 1. As far as I know the nexus and phylip formats can be read in without a problem. However it truncates the species names to a maximum of 10 characters, so pay attention if you have long species names. 2. ProtTest accepts simple newick format of trees (didn’t accept nexus file). Make sure the species names match those in the sequence file, especially if they are truncated, then the names in the tree file need to be truncated, too.

Following are 2 examples, they are pretty self-explaining.

java -jar prottest-3.2.jar -i input.file -t tree.file -o output.file -S 3 -all-distributions -F -AIC -BIC -DT -t1 -all -threads 4 > logfile & (run in background)

java -jar prottest-3.2.jar -i input.file -t tree.file -o output.file -S 3 -model JTT -F -AIC -BIC -DT -t1 -all -threads 4 > logfile &

One can use all matrices (models) or specify one of them (with -model option), but I’m not sure how to select a subset of the matrices. But this can be done very easily with the graphical interface.

*Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52: 696-704. Phyml*