User Tools

Site Tools


multi-gene_phylogeny_pipeline

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
multi-gene_phylogeny_pipeline [2017/10/31 09:31] 129.173.88.84multi-gene_phylogeny_pipeline [2018/03/10 11:07] (current) 173.212.69.201
Line 1: Line 1:
-**Multi-gene phylogeny tree using Matt Brown’s Bordor dataset and pipeline**+====== Multi-gene phylogeny tree using Matt Brown’s Bordor dataset and pipeline ====== 
    
 Documentation by Kate Glennon, Sarah Shah, Shelby Williams, and Tommy Harding. Documentation by Kate Glennon, Sarah Shah, Shelby Williams, and Tommy Harding.
    
-The **Bordor** dataset is a set of 351 housekeeping genes that are well-conserved across all eukaryotes. This pipeline uses the gene sequences from //Arabidopsis thaliana// as queries to fish for homologues during the BLAST step.+The **Bordor** dataset is a set of 351 housekeeping genes that are well-conserved across all eukaryotes. This pipeline uses the gene sequences from //Arabidopsis thaliana// as queries to fish for homologues during the BLAST step. Credit goes to Matt Brown & co.: [[https://doi.org/10.1093/gbe/evy014]], [[https://doi.org/10.1093/molbev/msx162]] 
 + 
 +All the original transcriptom/proteome files that are in the Bordor alignment is in **/scratch2/mbrown/PhylogenomicDatabases**
    
 **The Pipeline Overview** **The Pipeline Overview**
Line 102: Line 105:
 #$ -pe threaded 8 #$ -pe threaded 8
    
-python AddPipeline3.0a.py <short name> <complete species name with “_” in between the genus and species: Genus_species> 1 <AA or NUC> Bordor.351.refdat.txt ./ END.YY-MM-DD yes+python AddPipeline3.0a.py <short name> <complete species name with only one “_” in between the genus and species: Genus_species**> 1 <AA or NUC> Bordor.351.refdat.txt ./ END.YY-MM-DD yes
 </code> </code>
-The number “1” refers to the standard genetic code. Use “NUC” if your fasta file contains nucleotide sequences, or change it to “AA” for protein sequences. Say “yes” for the last flag at the end of the line if you want your alignments to be trimmed by bmge. Edit the date attached to “END*” to match today’s date. Make sure the AddPipeline3.X.py in your “START*” folder matches the one in this shell script. As of now, AddPipeline3.0a.py is the latest version. Then qsub Bordor.sh+The number “1” refers to the standard genetic code. Use “NUC” if your fasta file contains nucleotide sequences, or change it to “AA” for protein sequences. Say “yes” for the last flag at the end of the line if you want your alignments to be trimmed by bmge. Edit the date attached to “END*” to match today’s date. Make sure the AddPipeline3.X.py in your “START*” folder matches the one in this shell script. As of now, AddPipeline3.0a.py is the latest version. **Make sure your "short name" and "long name" are correct, i.e. 8 characters for the former, and the latter must have one "_"**. Then qsub Bordor.sh. Ensure that you are using a node with enough CPU's available, otherwise your Bordor.sh error file will show that the threadcount is out of range.
    
 NOTE: If you need to add sequences from several taxa: in step 1, instead of renaming the “END*” folder “START*”, rename it with the short name of the first organism you want to add; and in step 2, copy original fasta files for all the organisms of interest (renamed with the organism short names as instructed in step 2 above) upstream of the folder created in step 1. Save the Bordor.sh script at the same location and edit it as follows (for 3 organisms as example): NOTE: If you need to add sequences from several taxa: in step 1, instead of renaming the “END*” folder “START*”, rename it with the short name of the first organism you want to add; and in step 2, copy original fasta files for all the organisms of interest (renamed with the organism short names as instructed in step 2 above) upstream of the folder created in step 1. Save the Bordor.sh script at the same location and edit it as follows (for 3 organisms as example):
Line 128: Line 131:
 </code> </code>
 This will sequentially add the appropriate sequences for all the organisms of interest to the Bordor dataset. Trimming will not occur until the last taxon is added. This will sequentially add the appropriate sequences for all the organisms of interest to the Bordor dataset. Trimming will not occur until the last taxon is added.
- + 
 +NOTE2: If you have alignment files from someone else, and you want to add your own transcriptomes to them, move the alignment files in the folder "old_aln" in your START folder. 
 Step 4: If everything went as expected, there will be a folder named “bmge_trimmed_old” in the “END*” folder. Download a bunch of *.faa (aligned non-trimmed sequences) and *.bmge.fas (trimmed aligned sequences) files to your computer and examine them with a sequence viewer such as AliView. The last line(s) is the sequence from your transcriptome/protein data that was aligned to the other sequences of that particular gene. Make sure they look aligned; for instance, if all other sequences have a “GGG” in a specific location then you should expect your sequence to have the same. The .bmge.fas files are .faa files in which the badly-aligned positions were trimmed away.  Step 4: If everything went as expected, there will be a folder named “bmge_trimmed_old” in the “END*” folder. Download a bunch of *.faa (aligned non-trimmed sequences) and *.bmge.fas (trimmed aligned sequences) files to your computer and examine them with a sequence viewer such as AliView. The last line(s) is the sequence from your transcriptome/protein data that was aligned to the other sequences of that particular gene. Make sure they look aligned; for instance, if all other sequences have a “GGG” in a specific location then you should expect your sequence to have the same. The .bmge.fas files are .faa files in which the badly-aligned positions were trimmed away. 
    
Line 208: Line 213:
 iqtree-omp -bb 1000 -wbt -m LG4X -s outputnamehere.dat -nt 4 iqtree-omp -bb 1000 -wbt -m LG4X -s outputnamehere.dat -nt 4
 </code> </code>
-Step 16: Play around with your final tree! Reroot the tree - the common practice is to choose a point between the amorphea and the rest of the clades. Rename the short species names to their long format (Tommy wrote a script called Rename.py. It is located in /scratch2/sarahshah/START.24.05.2017ss/fas_only_badremoved/taxatrimmedfas/bmge), export the tree as a PDF and then edit it using Adobe Illustrator.+Step 16: Play around with your final tree! Reroot the tree - the common practice is to choose a point between the amorphea and the rest of the clades. Rename the short species names to their long format (Tommy wrote a script called RenameTree.py. It is located in /home/sarahshah/AdditionalScripts/Multigene_Phylogeny), export the tree as a PDF and then edit it using Adobe Illustrator.
  
 ---- ----
Line 215: Line 220:
    
 Bordor.sh Bordor.sh
 +
 Author: Yana Eglit Author: Yana Eglit
 +
 This submits the AddPipeline.py. This submits the AddPipeline.py.
    
Line 230: Line 237:
    
 MakeShell.py MakeShell.py
 +
 Author: Tommy Harding Author: Tommy Harding
 +
 Makes shell scripts for tree-making. Makes shell scripts for tree-making.
    
Line 239: Line 248:
    
 cpu_limit.sh cpu_limit.sh
 +
 Author: Tommy Harding & Gordon Lax Author: Tommy Harding & Gordon Lax
 +
 Uses Laura’s Perl script below to control how each tree shell scripts from a list of scripts are submitted to perun. Uses Laura’s Perl script below to control how each tree shell scripts from a list of scripts are submitted to perun.
    
 submit_cpulimit2.pl submit_cpulimit2.pl
 +
 Author: Laura Eme Author: Laura Eme
 +
 A Perl script to control the number of CPUs and waiting time for shell scripts to be submitted to perun. A Perl script to control the number of CPUs and waiting time for shell scripts to be submitted to perun.
  
multi-gene_phylogeny_pipeline.1509453067.txt.gz · Last modified: by 129.173.88.84