multi-gene_phylogeny_pipeline
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| multi-gene_phylogeny_pipeline [2017/08/29 09:06] – cgeb2001 | multi-gene_phylogeny_pipeline [2018/03/10 11:07] (current) – 173.212.69.201 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | **Multi-gene phylogeny tree using Matt Brown’s Bordor dataset and pipeline** | + | ====== |
| Documentation by Kate Glennon, Sarah Shah, Shelby Williams, and Tommy Harding. | Documentation by Kate Glennon, Sarah Shah, Shelby Williams, and Tommy Harding. | ||
| - | The **Bordor** dataset is a set of 351 housekeeping genes that are well-conserved across all eukaryotes. This pipeline uses the gene sequences from // | + | The **Bordor** dataset is a set of 351 housekeeping genes that are well-conserved across all eukaryotes. This pipeline uses the gene sequences from // |
| + | |||
| + | All the original transcriptom/ | ||
| **The Pipeline Overview** | **The Pipeline Overview** | ||
| Line 102: | Line 105: | ||
| #$ -pe threaded 8 | #$ -pe threaded 8 | ||
| - | python AddPipeline3.0a.py <short name> < | + | python AddPipeline3.0a.py <short name> < |
| </ | </ | ||
| - | The number “1” refers to the standard genetic code. Use “NUC” if your fasta file contains nucleotide sequences, or change it to “AA” for protein sequences. Say “yes” for the last flag at the end of the line if you want your alignments to be trimmed by bmge. Edit the date attached to “END*” to match today’s date. Make sure the AddPipeline3.X.py in your “START*” folder matches the one in this shell script. As of now, AddPipeline3.0a.py is the latest version. Then qsub Bordor.sh | + | The number “1” refers to the standard genetic code. Use “NUC” if your fasta file contains nucleotide sequences, or change it to “AA” for protein sequences. Say “yes” for the last flag at the end of the line if you want your alignments to be trimmed by bmge. Edit the date attached to “END*” to match today’s date. Make sure the AddPipeline3.X.py in your “START*” folder matches the one in this shell script. As of now, AddPipeline3.0a.py is the latest version. **Make sure your "short name" and "long name" are correct, i.e. 8 characters for the former, and the latter must have one " |
| NOTE: If you need to add sequences from several taxa: in step 1, instead of renaming the “END*” folder “START*”, | NOTE: If you need to add sequences from several taxa: in step 1, instead of renaming the “END*” folder “START*”, | ||
| Line 128: | Line 131: | ||
| </ | </ | ||
| This will sequentially add the appropriate sequences for all the organisms of interest to the Bordor dataset. Trimming will not occur until the last taxon is added. | This will sequentially add the appropriate sequences for all the organisms of interest to the Bordor dataset. Trimming will not occur until the last taxon is added. | ||
| - | + | ||
| + | NOTE2: If you have alignment files from someone else, and you want to add your own transcriptomes to them, move the alignment files in the folder " | ||
| Step 4: If everything went as expected, there will be a folder named “bmge_trimmed_old” in the “END*” folder. Download a bunch of *.faa (aligned non-trimmed sequences) and *.bmge.fas (trimmed aligned sequences) files to your computer and examine them with a sequence viewer such as AliView. The last line(s) is the sequence from your transcriptome/ | Step 4: If everything went as expected, there will be a folder named “bmge_trimmed_old” in the “END*” folder. Download a bunch of *.faa (aligned non-trimmed sequences) and *.bmge.fas (trimmed aligned sequences) files to your computer and examine them with a sequence viewer such as AliView. The last line(s) is the sequence from your transcriptome/ | ||
| Line 170: | Line 175: | ||
| Copy the first column that was printed out by the above command and paste it into a file, let’s call this “yourlist”. | Copy the first column that was printed out by the above command and paste it into a file, let’s call this “yourlist”. | ||
| - | Step 13: Copy mvtaxatrimmedfas.py, | + | Step 13: Copy mvtaxatrimmedfas.py, |
| + | < | ||
| + | python taxon_deletion.py yourlist | ||
| + | </ | ||
| + | Note, this script only recognizes a list of the short names in a column format. This will make *taxatrimmed.fas files. Move all of them into a new folder. | ||
| Then do: | Then do: | ||
| Line 186: | Line 195: | ||
| You can also make shell scripts for all the .py scripts above and qsub them, especially for ones that take some time, such as the sepalignmask.py. | You can also make shell scripts for all the .py scripts above and qsub them, especially for ones that take some time, such as the sepalignmask.py. | ||
| - | Step 14: Move all the .bmge.fas files to a new folder. Copy alvert_septable.py and seqtools.py into this folder from the main START folder. Do python alvert_septable.py -c bmge.fas outputnamehere.dat. This will generate a gene table text and a log file along with a phylogenomic supermatrix outputnamehere.dat. Rename the log file before running IQTree as IQTree also generates an output with a log extension with a similar name. The log file lists the number of genes missing from each taxon. | + | Step 14: Move all the .bmge.fas files to a new folder. Copy alvert_septable.py and seqtools.py into this folder from the main START folder. Do: |
| + | < | ||
| + | python alvert_septable.py -c bmge.fas outputnamehere.dat | ||
| + | </ | ||
| + | This will generate a gene table text and a log file along with a phylogenomic supermatrix outputnamehere.dat. Rename the log file before running IQTree as IQTree also generates an output with a log extension with a similar name. The log file lists the number of genes missing from each taxon. | ||
| Step 15: Run IQTree on the outputnamehere.dat file. Make sure to rename your outputnamehere.dat.log file before doing this, as the shell will create a file with the same name. You may need to qsub this to a 256G or a higher RAM node, as it needs a lot of memory to be able to run. | Step 15: Run IQTree on the outputnamehere.dat file. Make sure to rename your outputnamehere.dat.log file before doing this, as the shell will create a file with the same name. You may need to qsub this to a 256G or a higher RAM node, as it needs a lot of memory to be able to run. | ||
| Line 200: | Line 213: | ||
| iqtree-omp -bb 1000 -wbt -m LG4X -s outputnamehere.dat -nt 4 | iqtree-omp -bb 1000 -wbt -m LG4X -s outputnamehere.dat -nt 4 | ||
| </ | </ | ||
| - | Step 16: Play around with your final tree! Reroot the tree - the common practice is to choose a point between the amorphea and the rest of the clades. Rename the short species names to their long format (Tommy wrote a script called | + | Step 16: Play around with your final tree! Reroot the tree - the common practice is to choose a point between the amorphea and the rest of the clades. Rename the short species names to their long format (Tommy wrote a script called |
| ---- | ---- | ||
| Line 207: | Line 220: | ||
| Bordor.sh | Bordor.sh | ||
| + | |||
| Author: Yana Eglit | Author: Yana Eglit | ||
| + | |||
| This submits the AddPipeline.py. | This submits the AddPipeline.py. | ||
| Line 222: | Line 237: | ||
| MakeShell.py | MakeShell.py | ||
| + | |||
| Author: Tommy Harding | Author: Tommy Harding | ||
| + | |||
| Makes shell scripts for tree-making. | Makes shell scripts for tree-making. | ||
| Line 231: | Line 248: | ||
| cpu_limit.sh | cpu_limit.sh | ||
| + | |||
| Author: Tommy Harding & Gordon Lax | Author: Tommy Harding & Gordon Lax | ||
| + | |||
| Uses Laura’s Perl script below to control how each tree shell scripts from a list of scripts are submitted to perun. | Uses Laura’s Perl script below to control how each tree shell scripts from a list of scripts are submitted to perun. | ||
| submit_cpulimit2.pl | submit_cpulimit2.pl | ||
| + | |||
| Author: Laura Eme | Author: Laura Eme | ||
| + | |||
| A Perl script to control the number of CPUs and waiting time for shell scripts to be submitted to perun. | A Perl script to control the number of CPUs and waiting time for shell scripts to be submitted to perun. | ||
multi-gene_phylogeny_pipeline.1504008391.txt.gz · Last modified: by cgeb2001
