Computational Biosciences using HPC systems

Module 3: Phylogenomics

Tutorial

1. Introduction

ssh  -i /path/to/key user@cirrus8.a.incd.pt
sftp -i /path/to/key user@cirrus8.a.incd.pt

Software:

IQTREE
module load iqtree2/2.1.2
Figtree
module load figtree/1.4.3

File formats:

FASTA format https://en.wikipedia.org/wiki/FASTA_format
NEXUS format https://en.wikipedia.org/wiki/Nexus_file
PHYLIP format https://en.wikipedia.org/wiki/PHYLIP

Datasets

Turtle.fa
Turtle.nex

The dataset used in this exercise is a multiple sequence alignment in FASTA format. It is a subset of the original Turtle dataset used to assess the phylogenetic position of Turtles relative to Crocodiles and Birds (Chiari et al., 2012 MBC Biology 10; https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-10-65). The working dataset includes 29 genes, selected from the original set of 248 genes.

All the datasets are here: /users5/tutorial/modulo3/data

Copy the "data" folder to your home directory and check what is inside.
Use "ls" to see the list of files inside the folder "data".
Use "less" to read the files. Remember that you need to type "q" to exit the "less" window.

# Commands:
cp -r /users5/tutorial/modulo3/data .
ls data
less data/turtle.fa
less data/turtle_mb.nex
less data/turtle.nex

2. IQTREE

Where to get help:

http://www.iqtree.org
http://www.iqtree.org/doc/

2.1. Run IQTREE to estimate the best evolutionary model and how many threads (CPU cores) to use in the analysis of this dataset.

Create a new folder called "iqtree" to perform all analyses with this tool.
Copy the submission script "run_iqtree.sh" from '/users5/tutorial/modulo3/run_iqtree.sh' to your home directory.
Read the submission script to make sure you understand everything that is there.

mkdir iqtree
cp /users5/tutorial/modulo3/run_iqtree.sh .
less run_iqtree.sh

The submission script should look like this:

#!/bin/bash
#SBATCH -p hpc
#SBATCH --nodes=1   
#SBATCH --tasks-per-node=4  
#SBATCH --cpus-per-task=4    

# Path Variables
DATA_FOLDER='./data'
OUTPUT_FOLDER='./iqtree/model_threads
FILE='turtle.fa'
OUTGROUP='protopterus'

# Run code
module load iqtree2/2.1.2
iqtree2 -s $DATA_FOLDER/$FILE -st DNA -o $OUTGROUP -m TEST -nt AUTO -ntmax 6 -pre $OUTPUT_FOLDER

You can use the iqtree help menu to understand the previous command and see what other commands are available.

module load iqtree2/2.1.2
iqtree -h

Launch the analysis using --> sbatch ./run_iqtree.sh
Check if your analysis is in the queue --> squeue

Parameters used in this analysis:
-s -> Input alignment in PHYLIP/FASTA/NEXUS/CLUSTAL/MSF format
-st -> BIN, DNA, AA, NT2AA, CODON, MORPH (default: auto-detect)
-o -> Outgroup taxon name for writing .treefile
-pre -> Prefix for all output files (default: aln/partition)
-nt -> Number of cores/threads or AUTO for automatic detection
-ntmax -> Max number of threads by -nt AUTO (default: #CPU cores)

Once your analysis is done analyse the following output file:

less iqtree/model_threads.log

See how time changes with the number of CPU cores used.
Note the following definitions:
Speedup: The speedup is defined as the ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processors. I.e, "how much faster compared to 1 thread"

Efficiency: is defined as the ratio of speedup to the number of processors. Efficiency measures the fraction of time for which a processor is usefully utilised.

Questions:

Q: What was the best model?
Q: What is the best number of nodes/threads to use in the analysis of this dataset?

2.2. Run a maximum likelihood analysis with the previously selected model and assess branch support with ultrafast bootstrap using 1000 replicates. Use the previously selected best number of threads.

Using your favorite text editor (nano, vim or other) edit the previous submission script with the new code:
```
iqtree2 -s $DATA_FOLDER/$FILE -st DNA -o $OUTGROUP -m GTR+F+I+G4  -nt 6 -bb 1000 -bnni -pre OUTPUT_FOLDER
```
Note that you should also change the OUTPUT_FOLDER name to something like: OUTPUT_FOLDER='./iqtree/MLBoot'

Note: while using the ultrafast bootstrap analysis we use the option -bnni to reduce the risk of overestimating branch support with UFBoot due to severe model violations. With this option UFBoot will further optimize each bootstrap tree using a hill-climbing nearest neighbour interchange (NNI) search based directly on the corresponding bootstrap alignment.

Analyze the output files:

MLBoot.log: contains the screen output information.
MLBoot.iqtree: contains the IQ-TREE report.
MLBoot.contree: contains the consensus tree with assigned branch supports where branch lengths are optimized on the original alignment.

less iqtree/MLBoot.log
less iqtree/MLBoot.iqtree
less iqtree/MLBoot.contree

Questions:

Q: When reading the dataset the software performs a chi2 test. Did all sequences pass the test? What exactly is being tested here?

Four ways to visualize and manipulate the tree that you just inferred

Call the program Figtree on the server:
Note that this will only work if at login you specified "-X" (ssh -X -i /path/to/key username@cirrus8.a.incd.pt)
```
module load figtree/1.4.3
figtree
```

Download Figtree to your computer (https://github.com/rambaut/figtree/) and use sftp to download file from server to your computer.

#1. open a terminal window
#2. cd to the folder where you want to keep your files
#3. type:
sftp -i ~/.ssh/FCT_rsa username@cirrus8.a.incd.pt
#4. in the server cd to your iqtree folder
#5. type:
get MLBoot.treefile
#6. On your computer open this file with Figtree.

Visualize the tree using the web server of ETE Toolkit (http://etetoolkit.org/treeview/)
Visualize the tree using the web server of ITOL (https://itol.embl.de) ** recommended **

2.3. Run a maximum-likelihood analysis with a partition model, but use IQTREE ModelFinder to choose the right partitioning scheme.

ModelFinder implements a greedy strategy (Lanfear et al., 2012, https://academic.oup.com/mbe/article/29/6/1695/1000514) that starts with the full partition model and subsequently merges two genes until the model fit does not increase any further. After ModelFinder found the best partition, IQ-TREE will immediately start the tree reconstruction under the best-fit partition model.
To run this analysis use the following lines of code (note that we removed the parameter that indicates the outgroup):

PARTITION='turtle.nex'
OUTPUT_FOLDER='partition-merge'
iqtree2 -s $DATA_FOLDER/$FILE -p $DATA_FOLDER/$PARTITION -st DNA -m MFP+MERGE  -nt 6 -bb 1000 -pre $OUTPUT_FOLDER

Analyze the output files:

partition-merge.log: contains the screen output information.
partition-merge.iqtree: contains the IQ-TREE report. See here what was the optimal partition and their models.
partition-merge.treefile: contains the maximum-likelihood tree.
partition-merge.best_scheme.nex: contains the best partitioning scheme

3. Phylogenetic inference of HFV/Ebola virus amino acid sequences

We will use the freely available datasets from the HFV (hemorragic fever virus) and Ebola database project. https://hfv.lanl.gov/content/index Kuiken C, Thurmond J, Dimitrijevic M, Yoon H. The LANL hemorrhagic fever virus database, a new platform for analyzing biothreat viruses. Nucleic Acids Res. 2012 Jan;40:D587-92

Exercise 3.1: Phylogenetic inference of HFV virus using nucleotide sequences

Go to the HFV/Ebola database, HFV Resources, and extract a sequence alignment in fasta format with the following characteristics: Genus: Ebolavirus-Marburgvirus-Cuevavirus
alignment type: one sequence per outbreak
Region: spike glycoprotein
DNA sequences; 2015-Jan

Save your new dataset in your data folder.
Create a new folder called "iqtree_HFV" to keep the new output results from iqtree.
Infer the maximum-likelihood tree for this dataset using Iqtree and include an analysis of branch support using 1000 replicates of ultrafast bootstrap.

Q: how many sequences are in the dataset?
Q: What was the best model selected by Iqtree to run the maximum-likelihood analysis?

Choose one of the above methods to visualize the tree and study the phylogentic relationships using a mid-point rooting or position the root on the largest branch.
You can use the information on this page to interpret the sequence names: https://hfv.lanl.gov/content/sequence/HFV/ebola_names.html

Q: What can you learn from this phylogenetic analysis?

Exercise 3.2: Phylogenetic inference of HFV/Ebola virus using amino acid sequences

Go back to the HFV/Ebola database website and select another dataset now with protein sequences. Select the dataset you want to analyze making sure that it is not too large (<100 sequences), and it contains proteins sequences.

Q: What was the best model selected by Iqtree to run the maximum-likelihood analysis?
Q: What can you learn from this phylogenetic analysis?

Exercise 3.3: Phylogenetic inference of HFV virus ... yes, again... if you still have time...

Go to the HFV/Ebola database, HFV Resources, and extract other sequence alignments in fasta format.
Check if you get a different phylogenetic tree from exercise 3.1. if you now use the whole genome.
Check if you get a different phylogenetic tree from exercise 3.1. if you now use the protein sequence of the spike gene. Interpret the results.

Computational Biosciences using HPC systems

Module 3: Phylogenomics

Tutorial

1. Introduction

Login

Software:

File formats:

Datasets

2. IQTREE

Where to get help:

2.1. Run IQTREE to estimate the best evolutionary model and how many threads (CPU cores) to use in the analysis of this dataset.

2.2. Run a maximum likelihood analysis with the previously selected model and assess branch support with ultrafast bootstrap using 1000 replicates. Use the previously selected best number of threads.

Analyze the output files:

Four ways to visualize and manipulate the tree that you just inferred

2.3. Run a maximum-likelihood analysis with a partition model, but use IQTREE ModelFinder to choose the right partitioning scheme.

Analyze the output files:

3. Phylogenetic inference of HFV/Ebola virus amino acid sequences

This is the end!