phylogeny_protocol
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| phylogeny_protocol [2021/09/03 14:01] – created 134.190.232.139 | phylogeny_protocol [2021/09/29 12:53] (current) – 134.190.232.139 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | **1. Indexing sequences** | ||
| - | Indexing sequences, creating alignments, building trees and dN/dS analysis user guide | + | What if you have a dataset of interested gene names/IDs, and you want to figure out the homologous and paralogous related genes. The easiest way is to blast against NCBI, but wait! Where to find the protein or nucleotide sequences of your interested genes. Sure! NCBI name search could be the way, but what if your gene ID is not from NCBI and you have thousands of interested genes, e.g., At2G01130, AT4G00010, etc., these are from TAIR10 database (A.thaliana). You are going to need this resource |
| - | Note: Please | + | Software resource: |
| + | - Batch entrez | ||
| + | - index_header_to_seq | ||
| + | |||
| + | __Batch entrez__ is a web tool (https:// | ||
| + | - Prepare entrez ID file and upload | ||
| + | - Select the ID database | ||
| + | - Retrieve and save results in FASTA format | ||
| + | |||
| + | Note: The only thing is user will need to prepare a list of Entrez accession numbers or other identifiers file recognized by NCBI, e.g., QKK48680.1 which is a GenBank ID in protein database. User might get following error if select wrong database and not recognized NCBI ID. | ||
| + | |||
| + | < | ||
| + | An illegal character in a token. Possible wrong file format. Request processing canceled. | ||
| + | </ | ||
| + | |||
| + | |||
| + | As mentioned in the very beginning, if your interested gene name OR ID has nothing to do with NCBI and you need the Fasta sequence. There is a simple way to do this via a custom script | ||
| + | __index_header_to_seq.py__ (https:// | ||
| + | |||
| + | < | ||
| + | |||
| + | python3 index_header_to_seq.py database_sequence.fasta name_list.txt out_name_listed_seq.fa | ||
| + | |||
| + | # ' | ||
| + | # ' | ||
| + | # ' | ||
| + | |||
| + | </ | ||
| + | |||
| + | |||
| + | Now, feel free to explore your interested genes via the BLAST+ and v5 database user guide (please | ||
| + | |||
| + | |||
| + | **2. Creating and trimming alignments, building | ||
| + | |||
| + | Software resource: | ||
| + | |||
| + | - Clustal Omega 1.2.3 | ||
| + | - trimAl v1.2 | ||
| + | - FastTree 2.1 | ||
| + | |||
| + | |||
| + | Clustal Omega 1.2.3 (http:// | ||
| + | |||
| + | < | ||
| + | #For ubuntu system, simply run this to install | ||
| + | sudo apt install clustalo | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | |||
| + | ./ | ||
| + | </ | ||
| + | |||
| + | Note: //For protein alignments we recommend Clustal Omega. For DNA alignments we recommend trying MUSCLE or MAFFT.// https:// | ||
| + | |||
| + | trimAl v1.2 (http:// | ||
| + | (http:// | ||
| + | |||
| + | A very common way of using trimAl v1.2 to trim an alignment is to use just a gap threshold | ||
| + | (the minimum fraction of sequences without a gap that you require to consider a column of “enough quality”).Note: | ||
| + | < | ||
| + | trimal -in example1 -out output1 -htmlout output1.html -gt 1 | ||
| + | </ | ||
| + | |||
| + | Sometimes one does not know which alignment algorithm will perform best (or which parameters, e.g gap penalties). A way out is to just produce different alignments with the different algorithms and then choose the alignment that contains the most consistent residue-pairings, | ||
| + | < | ||
| + | trimal -compareset fileset1 -out output4 | ||
| + | trimal -compareset fileset1 -out output5 -htmlout output5.html -ct 0.5 | ||
| + | </ | ||
| + | |||
| + | FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences.http:// | ||
| + | |||
| + | < | ||
| + | FastTree < alignment_file > tree_file | ||
| + | </ | ||
| + | |||
| + | **3. dN/dS analysis** | ||
| + | |||
| + | Software requirements: | ||
| + | - PAML package | ||
| + | - pal2nal | ||
| + | - Clustal Omega | ||
| + | - FastTree | ||
| + | |||
| + | The calculation of synonymous (dS) and non-synonymous (dN) substitution rates is important to infer the evolutionary driving force: positive selection (dN/ | ||
| + | |||
| + | |||
| + | PAML is a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood. http:// | ||
| + | |||
| + | PAL2NAL is a program that converts a multiple sequence alignment of proteins and the corresponding DNA (or mRNA) sequences into a codon alignment.http:// | ||
| + | |||
| + | This is an example of batch script when dealing with dN/dS among thousands genes. | ||
| + | |||
| + | < | ||
| + | # | ||
| + | for i in *.txt | ||
| + | do | ||
| + | perl pal2nal.pl amino_acid.fa nucleotide.fa -out paml.file -nogap > folder/$i | ||
| + | done | ||
| + | </ | ||
| + | |||
| + | Shell script: codeml and configure file: codeml.ctl | ||
| + | |||
| + | Note: Please refer to the latest version of software | ||
| - | {{: | ||
| <Last updated by Xi Zhang on Sep 3rd, | <Last updated by Xi Zhang on Sep 3rd, | ||
phylogeny_protocol.1630688519.txt.gz · Last modified: by 134.190.232.139
