User Tools

Site Tools


phylogeny_protocol2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
phylogeny_protocol2 [2021/10/10 14:50] 134.190.232.9phylogeny_protocol2 [2022/02/07 15:21] (current) 134.190.232.106
Line 1: Line 1:
 +The GitHub resource for this protocol: https://github.com/zx0223winner/TreeTuner
 +
 **Background** **Background**
  
Line 318: Line 320:
 * As for NCBI-NR, the translated database header is like this: * As for NCBI-NR, the translated database header is like this:
  
 +<code>
 +>WP_048801694.1 ATP-dependent Clp protease ATP-binding subunit [Leuconostoc citreum]GEK62024.1 ATP-dependent Clp protease ATP-binding subunit ClpC [Leuconostoc citreum]
 +MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM
 +</code>
 +
 +<code>
 +#So simply run 
 +python3 rename_ncbi_blastdb.py (link upcoming soon). 
 +#The input file will include:
 +        fastaFile = '/db1/nr-nt-fasta-oct-2020/nr'
 +        taxidFile = '/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021/acc2tax_prot_all.txt'
 +</code>
 +
 +Then the desired output will be:
 +
 +<code>
 +>Leuconostoc_citreum@NCBI_WP_048801694.1_Bacteria_Terrabacteria_group_Firmicutes_Bacilli_Lactobacillales_Lactobacillaceae_Leuconostoc_Leuconostoc_citreum_33964
 +MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQ
 +</code>
 +
 +Directory to renamed MMETSP: /misc/scratch2/###/###/mmetsp
 +
 +Then with the two renamed database available, you could merge then by 'cat'. Then build the new merged database via 'makeblastdb'. Then Blast them again.8-)
 +
 +**3.Minimizing the redundancy and complexity of large phylogenetic datasets**
 +
 +Finally, after using two different methods, we can touch on the topic we raised up at very beginning. Coarse and fine-tuning large phylogenetic datasets via reducing the redundancy and complexity. 
 +
 +1. **Coarse-tuning**: Let's start with the relatively simple one coarse-tuning via Treetrimmer (Maruyama et.al 2013)
 +
 +<code>
 +ruby treetrimmer.rb sample/####_aligned_trimmed.newick sample/###_parameter_input.in sample/taxonomic_info.txt > ###_treetrimmer.newick
 +</code>
 +
 +The "##..newick" and "###input.in" files can easily be prepared. The taxonomic_info.txt;however need to reformatted.
 +
 +<code>
 +taxonomic_info.txt
 +NP_563657 Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis; Arabidopsis thaliana
 +XP_002889406 Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis; Arabidopsis lyrata; Arabidopsis lyrata subsp. lyrata
 +</code>
 +
 +The taxonomic_info.txt can be created by acc2tax program. please read more from here:http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol3
 +
 +__Note: The acc2tax need the gene ID without version (e.g.NP_563657), so as the NCBI ID.__ Please find the usage of the program: http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery; http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol3
 +
 +<code>
 +>WP_048801694.1 ATP-dependent Clp protease ATP-binding subunit [Leuconostoc citreum]GEK62024.1 ATP-dependent Clp protease ATP-binding subunit ClpC [Leuconostoc citreum]
 +MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM
 +</code>
 +
 +With the taxonomic_info.txt ready, you can get the tree file and another taxa file:
 +<code>
 +XP_026407875 Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; Ranunculales; Papaveraceae; Papaveroideae; Papaver; Papaver somniferum 2 4
 +XP_034682772 Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; rosids incertae sedis; Vitales; Vitaceae; Viteae; Vitis; Vitis riparia 2
 +</code>
 +
 +This tree give a rough tree diversity estimation.
 +
 +
 +2. **Fine-tuning**  Laura Eme (2012-14) written in Perl
 +
 +<code>
 +
 +#!/bin/bash
 +#$ -S /bin/bash
 +. /etc/profile
 +#$ -cwd
 +#$ -o logfile
 +#$ -pe threaded 20
 +#export PATH=/scratch2/software/anaconda/bin:$PATH
 +
 +while read line
 +do
 +
 +mafft --auto --thread 20 /misc/scratch2/####/$line.fasta >/misc/scratch2/####/aligned/$line.aligned.fasta
 +
 +/scratch2/software/anaconda/envs/bmge/bin/bmge -i /misc/scratch2/####/aligned/$line.aligned.fasta -t AA -m BLOSUM30 -of /misc/scratch2/xizhang/####/trimmed/$line.aligned.trimmed.fasta
 +
 +FastTree /misc/scratch2/####/trimmed/$line.aligned.trimmed.fasta > /misc/scratch2/####/fasttree/$line.aligned.trimmed.newick
 +
 +done <$1
 +</code>
 +
 +let's say after the mafft,bmge,fasttree steps. You have the trimmed alignment and new wick tree. Now let's use the perl script to prune the leaves or trim the branches.
 +
 +<code>
 +# These are files you will need. (links upcoming soon)
 +# rm_inparal_rank.pl taxa_rank.txt
 +# taxa_not_remove.txt trim2untrim.pl Instructions.txt lauralib.pm
 +
 +>perl rm_imparalogs <tree file> <alignment file> <distance cutoff> [taxa not to remove> <taxa rank>
 +#Will remove sister sequences from the same rank. Will ignore taxa in the list "taxa not to remove".
 +</code>
 +
 +It will yield the documents "###.removedSeq" and "###.fasttree".
 +
 +<code>
 +> perl trim2untrim.pl [trimmed alignement] [untrimmed alignment]
 +#Will remove sequences from the untrimmed alignement based on sequences present in the trimmed alignement
 +</code>
 +
 +Based on the trimmed aligned seq, you can re-analysis more rigorous downstream IQ-tree analysis.
 +
 +Note: not all genes' species have taxa.This have nothing to do with the updates of NCBI taxonomy.
  
 +The '0' in Gene name 'CP_0177652116_0_Stygamoeba_regulata_BSH-02190019' is not a NCBI taxid. 
  
  
-<Last updated by Xi Zhang on Oct 6th,2021> upcoming+<Last updated by Xi Zhang on Oct 6th,2021>
phylogeny_protocol2.1633888224.txt.gz · Last modified: by 134.190.232.9