Computational Biosciences using HPC systems

Module 3: Phylogenomics

Tutorial

1. Introduction

Login

ssh  -i /path/to/key user@cirrus8.a.incd.pt
sftp -i /path/to/key user@cirrus8.a.incd.pt

Software:

File formats:

Datasets

The dataset used in this exercise is a multiple sequence alignment in FASTA format. It is a subset of the original Turtle dataset used to assess the phylogenetic position of Turtles relative to Crocodiles and Birds (Chiari et al., 2012 MBC Biology 10; https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-10-65). The working dataset includes 29 genes, selected from the original set of 248 genes.

All the datasets are here: /users5/tutorial/modulo3/data

# Commands:
cp -r /users5/tutorial/modulo3/data .
ls data
less data/turtle.fa
less data/turtle_mb.nex
less data/turtle.nex

2. IQTREE

Where to get help:

http://www.iqtree.org
http://www.iqtree.org/doc/

2.1. Run IQTREE to estimate the best evolutionary model and how many threads (CPU cores) to use in the analysis of this dataset.

mkdir iqtree
cp /users5/tutorial/modulo3/run_iqtree.sh .
less run_iqtree.sh

The submission script should look like this:

#!/bin/bash
#SBATCH -p hpc
#SBATCH --nodes=1   
#SBATCH --tasks-per-node=4  
#SBATCH --cpus-per-task=4    

# Path Variables
DATA_FOLDER='./data'
OUTPUT_FOLDER='./iqtree/model_threads
FILE='turtle.fa'
OUTGROUP='protopterus'

# Run code
module load iqtree2/2.1.2
iqtree2 -s $DATA_FOLDER/$FILE -st DNA -o $OUTGROUP -m TEST -nt AUTO -ntmax 6 -pre $OUTPUT_FOLDER

You can use the iqtree help menu to understand the previous command and see what other commands are available.

module load iqtree2/2.1.2
iqtree -h

Parameters used in this analysis:
-s -> Input alignment in PHYLIP/FASTA/NEXUS/CLUSTAL/MSF format
-st -> BIN, DNA, AA, NT2AA, CODON, MORPH (default: auto-detect)
-o -> Outgroup taxon name for writing .treefile
-pre -> Prefix for all output files (default: aln/partition)
-nt -> Number of cores/threads or AUTO for automatic detection
-ntmax -> Max number of threads by -nt AUTO (default: #CPU cores)

Once your analysis is done analyse the following output file:

less iqtree/model_threads.log

See how time changes with the number of CPU cores used.
Note the following definitions:
Speedup: The speedup is defined as the ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processors. I.e, "how much faster compared to 1 thread"

Efficiency: is defined as the ratio of speedup to the number of processors. Efficiency measures the fraction of time for which a processor is usefully utilised.

Questions:

2.2. Run a maximum likelihood analysis with the previously selected model and assess branch support with ultrafast bootstrap using 1000 replicates. Use the previously selected best number of threads.

Note: while using the ultrafast bootstrap analysis we use the option -bnni to reduce the risk of overestimating branch support with UFBoot due to severe model violations. With this option UFBoot will further optimize each bootstrap tree using a hill-climbing nearest neighbour interchange (NNI) search based directly on the corresponding bootstrap alignment.

Analyze the output files:

MLBoot.log: contains the screen output information.
MLBoot.iqtree: contains the IQ-TREE report.
MLBoot.contree: contains the consensus tree with assigned branch supports where branch lengths are optimized on the original alignment.

less iqtree/MLBoot.log
less iqtree/MLBoot.iqtree
less iqtree/MLBoot.contree

Questions:

Four ways to visualize and manipulate the tree that you just inferred

  1. Call the program Figtree on the server:
    Note that this will only work if at login you specified "-X" (ssh -X -i /path/to/key username@cirrus8.a.incd.pt)

    module load figtree/1.4.3
    figtree
  2. Download Figtree to your computer (https://github.com/rambaut/figtree/) and use sftp to download file from server to your computer.

    #1. open a terminal window
    #2. cd to the folder where you want to keep your files
    #3. type:
    sftp -i ~/.ssh/FCT_rsa username@cirrus8.a.incd.pt
    #4. in the server cd to your iqtree folder
    #5. type:
    get MLBoot.treefile
    #6. On your computer open this file with Figtree.
  3. Visualize the tree using the web server of ETE Toolkit (http://etetoolkit.org/treeview/)

  4. Visualize the tree using the web server of ITOL (https://itol.embl.de) ** recommended **

2.3. Run a maximum-likelihood analysis with a partition model, but use IQTREE ModelFinder to choose the right partitioning scheme.

ModelFinder implements a greedy strategy (Lanfear et al., 2012, https://academic.oup.com/mbe/article/29/6/1695/1000514) that starts with the full partition model and subsequently merges two genes until the model fit does not increase any further. After ModelFinder found the best partition, IQ-TREE will immediately start the tree reconstruction under the best-fit partition model.
To run this analysis use the following lines of code (note that we removed the parameter that indicates the outgroup):

PARTITION='turtle.nex'
OUTPUT_FOLDER='partition-merge'
iqtree2 -s $DATA_FOLDER/$FILE -p $DATA_FOLDER/$PARTITION -st DNA -m MFP+MERGE  -nt 6 -bb 1000 -pre $OUTPUT_FOLDER

Analyze the output files:

partition-merge.log: contains the screen output information.
partition-merge.iqtree: contains the IQ-TREE report. See here what was the optimal partition and their models.
partition-merge.treefile: contains the maximum-likelihood tree.
partition-merge.best_scheme.nex: contains the best partitioning scheme

3. Phylogenetic inference of HFV/Ebola virus amino acid sequences

We will use the freely available datasets from the HFV (hemorragic fever virus) and Ebola database project. https://hfv.lanl.gov/content/index Kuiken C, Thurmond J, Dimitrijevic M, Yoon H. The LANL hemorrhagic fever virus database, a new platform for analyzing biothreat viruses. Nucleic Acids Res. 2012 Jan;40:D587-92

Exercise 3.1: Phylogenetic inference of HFV virus using nucleotide sequences

Q: how many sequences are in the dataset?
Q: What was the best model selected by Iqtree to run the maximum-likelihood analysis?

Q: What can you learn from this phylogenetic analysis?

Exercise 3.2: Phylogenetic inference of HFV/Ebola virus using amino acid sequences

Q: What was the best model selected by Iqtree to run the maximum-likelihood analysis?
Q: What can you learn from this phylogenetic analysis?

Exercise 3.3: Phylogenetic inference of HFV virus ... yes, again... if you still have time...

This is the end!