multi-gene_phylogeny_pipeline
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| multi-gene_phylogeny_pipeline [2017/06/16 09:54] – cgeb2001 | multi-gene_phylogeny_pipeline [2018/03/10 11:07] (current) – 173.212.69.201 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | **Multi-gene phylogeny tree using Matt Brown’s Bordor dataset and pipeline** | + | ====== |
| Documentation by Kate Glennon, Sarah Shah, Shelby Williams, and Tommy Harding. | Documentation by Kate Glennon, Sarah Shah, Shelby Williams, and Tommy Harding. | ||
| - | The **Bordor** dataset is a set of 351 housekeeping genes that are well-conserved across all eukaryotes. This pipeline uses the gene sequences from // | + | The **Bordor** dataset is a set of 351 housekeeping genes that are well-conserved across all eukaryotes. This pipeline uses the gene sequences from // |
| + | |||
| + | All the original transcriptom/ | ||
| **The Pipeline Overview** | **The Pipeline Overview** | ||
| Line 63: | Line 66: | ||
| (Roger, AJ., lecture slide for BIOC4010) | (Roger, AJ., lecture slide for BIOC4010) | ||
| + | |||
| ---- | ---- | ||
| + | |||
| Software involved in this pipeline: **BLAST** ([[https:// | Software involved in this pipeline: **BLAST** ([[https:// | ||
| Line 73: | Line 78: | ||
| **Code availability: | **Code availability: | ||
| - | As of June 14 2017, you can copy the following folder as your “START”: | + | As of June 14 2017, you can copy the following folder as your “START*”: |
| / | / | ||
| Additional scripts listed in the following instructions can be copied from: | Additional scripts listed in the following instructions can be copied from: | ||
| - | /scratch2/sarahshah/START.15.06.17ss | + | /home/sarahshah/AdditionalScripts/ |
| ---- | ---- | ||
| + | |||
| **Command line usage and scripts for Matt Brown’s multi-gene phylogeny pipeline** | **Command line usage and scripts for Matt Brown’s multi-gene phylogeny pipeline** | ||
| - | This pipeline builds on others’ databases, so if you want the most updated, most diverse set of genes and species, copy the “END” folder from someone you know who just finished running the pipeline. Database directories can change too, so please refer to the last person who used it. For example, when Matt Brown first came up with this pipeline, it linked to the OrthoMCL database on Scinet (University of Toronto), but now it is linked to Yana’s directory. | + | This pipeline builds on others’ databases, so if you want the most updated, most diverse set of genes and species, copy the “END*” folder from someone you know who just finished running the pipeline. Database directories can change too, so please refer to the last person who used it. For example, when Matt Brown first came up with this pipeline, it linked to the OrthoMCL database on Scinet (University of Toronto), but now it is linked to Yana’s directory. |
| The following are the instructions for running the pipeline. They are adapted from Matt Brown’s instructions in AddPipeline3.0.py. The function of each shell script will be explained in the next section. For now, please use AddPipeline3.0a.py as this version has the OrthoMCL database linked to Yana’s directory. | The following are the instructions for running the pipeline. They are adapted from Matt Brown’s instructions in AddPipeline3.0.py. The function of each shell script will be explained in the next section. For now, please use AddPipeline3.0a.py as this version has the OrthoMCL database linked to Yana’s directory. | ||
| - | Step 1: Copy an “END” folder into your directory. Rename it as a “START” folder with the current date. Copy Bordor.sh, MakeShell.py, | + | Step 1: Copy an “END*” folder into your directory. Rename it as a “START*” folder with the current date. Copy Bordor.sh, MakeShell.py, |
| - | Step 2: Copy your transcriptome fasta file or your translated protein fasta file (such as through using TransDecoder) to your START folder. Rename this file according to the 8-letter-format plus the .fas extension as described in the AddPipeline.py script. For example, | + | Step 2: Copy your transcriptome fasta file or your translated protein fasta file (such as through using TransDecoder) to your “START*” folder. Rename this file according to the 8-letter-format plus the .fas extension as described in the AddPipeline.py script |
| - | Step 3: Edit Bordor.sh and change the input file name to yours. The number “1” refers to the standard genetic code. Keep the “NUC” if your file contains nucleotide sequences, or change it to “AA” for protein sequences. Say “yes” for the last flag at the end of the line (if you want your alignments to be trimmed by BMGE). Change the date attached to “END” to match the date of this START folder. Make sure the AddPipline3.X.py in your START folder matches the one in this shell script. As of now, AddPipeline3.0a.py is the latest version. Then qsub Bordor.sh | + | Step 3: If you need to add sequences from only one taxon, do the following, otherwise see NOTE at the end of step 3. Edit Bordor.sh and change the input file name to yours, |
| + | < | ||
| + | # | ||
| + | #$ -S /bin/sh | ||
| + | . / | ||
| + | #$ -o matt_pipeline.info.txt | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 8 | ||
| - | *You can also run the pipeline directly from the command line: | + | python AddPipeline3.0a.py |
| - | + | </ | |
| - | python AddPipeline3.0a.py | + | The number “1” refers to the standard genetic code. Use “NUC” if your fasta file contains nucleotide sequences, or change it to “AA” for protein sequences. Say “yes” for the last flag at the end of the line if you want your alignments to be trimmed by bmge. Edit the date attached to “END*” to match today’s date. Make sure the AddPipeline3.X.py in your “START*” folder matches the one in this shell script. As of now, AddPipeline3.0a.py is the latest version. **Make sure your "short name" and "long name" are correct, i.e. 8 characters for the former, and the latter must have one " |
| - | Step 4: If the script finished, there will be a folder | + | NOTE: If you need to add sequences from several taxa: in step 1, instead of renaming the “END*” |
| + | < | ||
| + | # | ||
| + | #$ -S /bin/sh | ||
| + | . /etc/ | ||
| + | #$ -o matt_pipeline.info.txt | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 8 | ||
| - | Step 4.5: If you have multiple new transcriptomes/protein sequences to add to this pipeline, you can pause here and start from Step 1 again using the END folder produced at this point, and add in your new sequence. The goal of AddPipeline.py is to make .bmge.fas files that can then be made into single gene trees. | + | cp Genuspec1.fas Genuspec1/ |
| + | cd Genuspec1/ | ||
| + | python AddPipeline3.0a.py Genuspec1 Genus_species1 1 AA Bordor.351.refdat.txt ./ Genuspec2 no | ||
| - | Step 5: Edit the MakeShell.py script to create shell scripts for each .bmge.fas file to make a single-gene phylogenetic tree. As of now, FastTree, RAxML, and IQTree options are available. You can change the models, number of CPUs to use, etc. View the “help” of each program (FastTree -help, raxmlHPC -help, iqtree-omp -h) to help you choose. | + | mkdir ../ |
| + | cd ../ | ||
| + | python AddPipeline3.0a.py Genuspec2 Genus_species2 1 AA Bordor.351.refdat.txt ./ Genuspec3 no | ||
| - | Step 6: Run the MakeShell.py by typing in: | + | mkdir ../ |
| + | cd ../ | ||
| + | python AddPipeline3.0a.py Genuspec3 Genus_species3 1 AA Bordor.351.refdat.txt ./ END.YY-MM-DD yes | ||
| + | </ | ||
| + | This will sequentially add the appropriate sequences for all the organisms of interest to the Bordor dataset. Trimming will not occur until the last taxon is added. | ||
| - | python MakeShell.py <iqtree, RAxML, or FastTree> | + | NOTE2: If you have alignment files from someone else, and you want to add your own transcriptomes to them, move the alignment files in the folder " |
| + | Step 4: If everything went as expected, there will be a folder named “bmge_trimmed_old” in the “END*” folder. Download a bunch of *.faa (aligned non-trimmed sequences) and *.bmge.fas (trimmed aligned sequences) files to your computer and examine them with a sequence viewer such as AliView. The last line(s) is the sequence from your transcriptome/ | ||
| + | |||
| + | Step 5: Copy all the bmge.fas files to a new folder (eg. mkdir Single_gene_trees_Genuspec) together with the script MakeShell.py (which creates a shell script for each *.bmge.fas file to build a single-gene phylogenetic tree). As of now, FastTree, RAxML, and IQTree options are available. You can edit MakeShell.py according to your needs by changing the models, number of CPUs to use, etc. View the “help” of each program (FastTree -help, raxmlHPC -help, iqtree-omp -h) to help you choose. | ||
| + | |||
| + | Step 6: Run the MakeShell.py by typing in: | ||
| + | < | ||
| + | python MakeShell.py <iqtree, RAxML, or FastTree> | ||
| + | </ | ||
| You should immediately see new .sh files made for each gene. | You should immediately see new .sh files made for each gene. | ||
| Step 7: This step until Step 9 are for submitting all those tree shell scripts in an efficient way. Make a simple file containing a list of all the .sh just created by typing: | Step 7: This step until Step 9 are for submitting all those tree shell scripts in an efficient way. Make a simple file containing a list of all the .sh just created by typing: | ||
| + | < | ||
| ls iq*sh > list_scripts_iqtree (or Fast*sh or RAxML*sh to the name of another tree program) | ls iq*sh > list_scripts_iqtree (or Fast*sh or RAxML*sh to the name of another tree program) | ||
| - | + | </ | |
| Step 8: Create a text file (called “max_cpu.txt”) that contain the total number CPUs you would like to use when building your phylogenies and another that specifies the number of seconds to wait between submitting each shell script called “sleep_time.txt”. You can do this by: | Step 8: Create a text file (called “max_cpu.txt”) that contain the total number CPUs you would like to use when building your phylogenies and another that specifies the number of seconds to wait between submitting each shell script called “sleep_time.txt”. You can do this by: | ||
| - | * echo “40” > max_cpu.txt | + | < |
| - | | + | echo “40” > max_cpu.txt |
| - | + | echo “2” > sleep_time.txt | |
| + | </ | ||
| The above files can be edited in real-time; increase the CPUs and sleep time if jobs are slow in being submitted. A good option for FastTree is 40 CPUs and 2 seconds of sleep time. For IQTree, 40 CPUs and 60 seconds of sleep time seem to work. You do not need to cancel your jobs after editing these files. | The above files can be edited in real-time; increase the CPUs and sleep time if jobs are slow in being submitted. A good option for FastTree is 40 CPUs and 2 seconds of sleep time. For IQTree, 40 CPUs and 60 seconds of sleep time seem to work. You do not need to cancel your jobs after editing these files. | ||
| Line 132: | Line 169: | ||
| ii) the data related to these taxa were generated by someone else and are not published yet, iii) taxa irrelevant to your study. You might want to keep a few organisms as outgroups. | ii) the data related to these taxa were generated by someone else and are not published yet, iii) taxa irrelevant to your study. You might want to keep a few organisms as outgroups. | ||
| For example, to keep all the taxa that are publicly available, do: | For example, to keep all the taxa that are publicly available, do: | ||
| - | * grep “YES” Bordor.Taxa.txt > YESlist | + | < |
| - | | + | grep “YES” Bordor.Taxa.txt > YESlist |
| + | awk ‘{print $1}’ YESlist | ||
| + | </ | ||
| Copy the first column that was printed out by the above command and paste it into a file, let’s call this “yourlist”. | Copy the first column that was printed out by the above command and paste it into a file, let’s call this “yourlist”. | ||
| - | Step 13: Copy mvtaxatrimmedfas.py, | + | Step 13: Copy mvtaxatrimmedfas.py, |
| + | < | ||
| + | python taxon_deletion.py yourlist | ||
| + | </ | ||
| + | Note, this script only recognizes a list of the short names in a column format. This will make *taxatrimmed.fas files. Move all of them into a new folder. | ||
| Then do: | Then do: | ||
| - | * python mvtaxatrimmedfas.py. | + | < |
| + | python mvtaxatrimmedfas.py | ||
| + | </ | ||
| This will rename the taxatrimmed.fas files to .fas files. | This will rename the taxatrimmed.fas files to .fas files. | ||
| Next do: | Next do: | ||
| - | * python sepalignmask.py numberofthreads. | + | < |
| + | python sepalignmask.py numberofthreads | ||
| + | </ | ||
| This will re-align (using MAFFT) and trim (using BMGE) the sequences in these files, producing .faa and .bmge files. | This will re-align (using MAFFT) and trim (using BMGE) the sequences in these files, producing .faa and .bmge files. | ||
| You can also make shell scripts for all the .py scripts above and qsub them, especially for ones that take some time, such as the sepalignmask.py. | You can also make shell scripts for all the .py scripts above and qsub them, especially for ones that take some time, such as the sepalignmask.py. | ||
| - | Step 14: Move all the .bmge.fas files to a new folder. Copy alvert_septable.py and seqtools.py into this folder from the main START folder. Do python alvert_septable.py -c bmge.fas outputnamehere.dat. This will generate a gene table text and a log file along with a phylogenomic supermatrix outputnamehere.dat. Rename the log file before running IQTree as IQTree also generates an output with a log extension with a similar name. The log file lists the number of genes missing from each taxon. | + | Step 14: Move all the .bmge.fas files to a new folder. Copy alvert_septable.py and seqtools.py into this folder from the main START folder. Do: |
| + | < | ||
| + | python alvert_septable.py -c bmge.fas outputnamehere.dat | ||
| + | </ | ||
| + | This will generate a gene table text and a log file along with a phylogenomic supermatrix outputnamehere.dat. Rename the log file before running IQTree as IQTree also generates an output with a log extension with a similar name. The log file lists the number of genes missing from each taxon. | ||
| Step 15: Run IQTree on the outputnamehere.dat file. Make sure to rename your outputnamehere.dat.log file before doing this, as the shell will create a file with the same name. You may need to qsub this to a 256G or a higher RAM node, as it needs a lot of memory to be able to run. | Step 15: Run IQTree on the outputnamehere.dat file. Make sure to rename your outputnamehere.dat.log file before doing this, as the shell will create a file with the same name. You may need to qsub this to a 256G or a higher RAM node, as it needs a lot of memory to be able to run. | ||
| Here is an example of a shell script for this: | Here is an example of a shell script for this: | ||
| + | < | ||
| #!/bin/sh | #!/bin/sh | ||
| - | |||
| #$ -S /bin/sh | #$ -S /bin/sh | ||
| - | |||
| . / | . / | ||
| - | |||
| #$ -cwd | #$ -cwd | ||
| - | |||
| #$ | #$ | ||
| - | |||
| #$ -pe threaded 4 | #$ -pe threaded 4 | ||
| iqtree-omp -bb 1000 -wbt -m LG4X -s outputnamehere.dat -nt 4 | iqtree-omp -bb 1000 -wbt -m LG4X -s outputnamehere.dat -nt 4 | ||
| - | + | </ | |
| - | Step 16: Play around with your final tree! Reroot the tree - the common practice is to choose a point between the amorphea and the rest of the clades. Rename the short species names to their long format (Tommy wrote a script called | + | Step 16: Play around with your final tree! Reroot the tree - the common practice is to choose a point between the amorphea and the rest of the clades. Rename the short species names to their long format (Tommy wrote a script called |
| ---- | ---- | ||
| Line 175: | Line 220: | ||
| Bordor.sh | Bordor.sh | ||
| + | |||
| Author: Yana Eglit | Author: Yana Eglit | ||
| + | |||
| This submits the AddPipeline.py. | This submits the AddPipeline.py. | ||
| Line 190: | Line 237: | ||
| MakeShell.py | MakeShell.py | ||
| + | |||
| Author: Tommy Harding | Author: Tommy Harding | ||
| + | |||
| Makes shell scripts for tree-making. | Makes shell scripts for tree-making. | ||
| Line 199: | Line 248: | ||
| cpu_limit.sh | cpu_limit.sh | ||
| + | |||
| Author: Tommy Harding & Gordon Lax | Author: Tommy Harding & Gordon Lax | ||
| + | |||
| Uses Laura’s Perl script below to control how each tree shell scripts from a list of scripts are submitted to perun. | Uses Laura’s Perl script below to control how each tree shell scripts from a list of scripts are submitted to perun. | ||
| submit_cpulimit2.pl | submit_cpulimit2.pl | ||
| + | |||
| Author: Laura Eme | Author: Laura Eme | ||
| + | |||
| A Perl script to control the number of CPUs and waiting time for shell scripts to be submitted to perun. | A Perl script to control the number of CPUs and waiting time for shell scripts to be submitted to perun. | ||
| + | |||
| + | |||
| + | ---- | ||
| + | |||
| + | **Adding more hits to the pipeline:** | ||
| + | |||
| + | This pipeline only takes in the best hit from the local blastp step against the //A. thaliana// genes. If you're interested in looking at the positions of the other hits (the blastp step lists out 10 hits each gene), I have an edited version of the pipeline in / | ||
multi-gene_phylogeny_pipeline.1497617646.txt.gz · Last modified: by cgeb2001
