Compile R packages under MacOSX 10.7 (lion)

When I tried to build an R package from source file using the following command:

R CMD INSTALL packagename

I encountered this error: “ld: library not found for -lgfortran”, this is because I don’t have fortran compiler installed. Getting the latest gfortran compiler didn’t solve the problem for me, but an easy fix is:

Go to http://cran.r-project.org/bin/macosx/tools/ and download the first package and install it. Problem solved!

Run ProtTest in command line

ProtTest is a bioinformatic tool for the selection of best-fit models of amino acid replacement for the data at hand. It uses PhyML  (Guindon and Gascuel, 2003) to calculate the maximum likelihood (ML) estimates of phylogenetic trees and model parameters. Models included in ProtTest are empirical models of amino acids substitution. After obtaining the ML estimates and loglikelihood, it compares the models and find the best model according to Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) scores or Decision Theory Criterion (DT). ProtTest also provides other information about parameters and trees. I’ll show how to use ProtTest in command line in the following.

For command line execution of ProtTest, these are the parameters that can be set by user (copied from README in the package):

-i alignment_filename
alignment input file (required)
-t tree_filename
tree file (optional) [default: NJ tree]
-o output_filename
output file (optional) [default: standard output]
-S optimization_strategy
optimization strategy mode: [default: 0]
0: Fixed BIONJ JTT
1: BIONJ Tree
2: Maximum Likelihood tree
3: User defined topology
-[matrix]
Include matrix (Amino-acid) = JTT LG DCMut MtREV MtMam MtArt Dayhoff WAG
RtREV CpREV Blosum62 VT HIVb HIVw FLU
If you don’t specify any matrix, all matrices displayed above will
be included.

-I
include models with a proportion of invariable sites
-G
include models with rate variation among sites and number of categories
-IG
include models with both +I and +G properties
-all-distributions
include models with rate variation among sites, number of categories and both
-ncat number_of_categories
define number of categories for +G and +I+G models [default: 4]
-F
include models with empirical frequency estimation
-AIC
display models sorted by Akaike Information Criterion (AIC)
-BIC
display models sorted by Bayesian Information Criterion (BIC)
-AICC
display models sorted by Corrected Akaike Information Criterion (AICc)
-DT
display models sorted by Decision Theory Criterion
-sample sample_size_mode
sample size for AICc and BIC corrections [default: 2]
0: Shannon-entropy Sum
1: Average (0-1)Shannon-entropy x NxL
2: Total number of characters (aligment length)
3: Number of variable characters
4: Alignment length x num taxa (NxL)
5: Specified by the user
-size user_size
specified sample size, only for “-sample 5″
-t1
display best-model’s newick tree [default: false]
-t2
display best-model’s ASCII tree [default: false]
-tc consensus_threshold
display consensus tree with the specified threshold
-all
Displays a 7-framework comparison table
-verbose
verbose mode [default: false]
-threads number_or_threads
number of threads to compute (only if MPJ is not used) [default: 1]

The command need to be called in the directory of the folder that contains the program, but input files and output files don’t have to be in the same directory, their paths can be specified in the command line.

Several things to mention about the input files: 1. As far as I know the nexus and phylip formats can be read in without a problem. However it truncates the species names to a maximum of 10 characters, so pay attention if you have long species names. 2. ProtTest accepts simple newick format of trees (didn’t accept nexus file). Make sure the species names match those in the sequence file, especially if they are truncated, then the names in the tree file need to be truncated, too.

Following are 2 examples, they are pretty self-explaining.

java -jar prottest-3.2.jar -i input.file -t tree.file -o output.file -S 3 -all-distributions -F -AIC -BIC -DT -t1 -all -threads 4 > logfile & (run in background)

java -jar prottest-3.2.jar -i input.file -t tree.file -o output.file -S 3 -model JTT -F -AIC -BIC -DT -t1 -all -threads 4 > logfile & 

One can use all matrices (models) or specify one of them (with -model option), but I’m not sure how to select a subset of the matrices. But this can be done very easily with the graphical interface.

  • Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52: 696-704. Phyml

Model II Regression, R package ‘lmodel2′

In simple linear regression, the explanatory variables (x) are assumed to be measured accurately and there is only measurement error in the response variable (y). When both variables in the regression are random, the linear square approach underestimates the slope of the linear relationship. In this case, model II regression should be used instead. The package lmodel2 (CRAN site for this package) in R provides several methods to do this kind of regression.

codeml in PAML package

Goldman and Yang(1994) proposed one of the earliest codon substitution models, which utilizes information on both nucleotide level in the DNA sequences and the amino acid level   of synonymous and non-synonymous codon substitution. Since then a big number of codon models have been developed, see Yang and Bielawski (2000) and Yang (2002). This is a brief description of how to analyze data under the codon substitution models, using codeml in the PAML under the MacOS environment.

First, get the input files ready. The sequence data should be codon sequences (groups of 3 nucleotides), preferably in PHYLIP format. If you want to provide your own tree, or specify different ratios of nonsynonymous/synonymous substitution rates on different branches, a tree file should also be included.

Second, edit the control file. This file specifies all the setting of the model desired, as well as the control of input and output. For example, the file could be named after the data file: example.ctl.

Now everything is ready, to execute the analysis, just type codeml example.ctl in the terminal. If the control file is named as codeml.ctl then the command could simply be codeml.

More details to be added about each step, including how to read the output.