Differences

This shows you the differences between two versions of the page.

--- phylogeny_protocol2 [2021/10/10 12:56] – 134.190.232.9
+++ phylogeny_protocol2 [2022/02/07 15:21] (current) – 134.190.232.106
@@ Line 1: / Line 1: @@
+The GitHub resource for this protocol: https://github.com/zx0223winner/TreeTuner
 **Background**
@@ Line 233: / Line 235: @@
 Finally, (～￣▽￣)～ we have the desired BLAST hits headers from both MMETSP and NCBI-nr containing the hierarchical taxonomic terms. You can then merge the two files to play around what you are most familiar with seq aligning, seq trimming, and tree building. please read more http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol
-But after all of these, you will need the step to color your newick tree. Please find this script color_newick_tree.py via
+But after all of these, you will need the step to color your newick tree. Please find this script color_newick_tree.py via http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol4
 **2. Method Two: blasting after renaming the header**
-Still remember we mentioned earlier due to the different naming strategy, we have to decide to blast before renaming the header or after?
+Still remember we mentioned earlier due to the different naming strategy, we have to decide to blast before renaming the header or after? If you have read the Method One, you might feel it is from simple to complex. However, the Method TWO is from complex to simple, because it will take much longer than you thought to prepare the input files. But once you done, the rest of things could be much easier. Ok, let me stop eating around the corner. Our goal is to actually recreate the NCBI and MMETSP database headers like this:
+<code>
+>Compsopogon_coeruleus@CP_0184679990_Eukaryota_Rhodophyta_Compsopogonophyceae_Compsopogonales_Compsopogonaceae_Compsopogon_Compsopogon_caeruleus_31354_Compsopogon_coeruleus
+XNKTVGEKEKVDVGKKGGGGEEREMVGFVSDVFISLNLEWSRVGVGVVNSRGKRKVYAVGEFPGSSPGRTSVLVPQKEKVQKESKEKKRSHGGGKYKVLILNDAFNSMEYVAATLLRLIPGMTTELAWKVMKEAHENGAAVVGVWVFELAEAYCDAIQSAGIGSRIEPE
+>Tarenaya_hassleriana@NCBI_XP_010551290.1_Eukaryota_Viridiplantae_Streptophyta_Streptophytina_Embryophyta_Tracheophyta_Euphyllophyta_Spermatophyta_Magnoliopsida_Mesangiospermae_eudicotyledons_Gunneridae_Pentapetalae_rosids_malvids_Brassicales_Cleomaceae_New_World_clade_Tarenaya_Tarenaya_hassleriana_28532
+MESAICGRLALSPSTVFNSKPGEKHSLYKGPCGNHGFVMSLCASAVGKGGGLLDKPVIEKTTPGRESEFDLRKSRKMAPPYRVILHNDNFNRREYVVQVLMKVIPGMTLDNAVNIMQEAHHNGLAVVIICAQADAEEHCMQLRGNGLLSSIEPASGGGC
+</code>
+Then use makeblastdb command to make the database compiled files. Considering the size of NCBI ~100Gb and the MMETSP (~10GB), I have not really tested myself. But i assume it might take at least three days running. Here, I will simply provide the method for you to feel free to use.
+* As for MMETSP, the translated database header is like this:
+<code>
+>CP_0113232432_97485_Prymnesium_parvum
+</code>
+To pull out the taxa information, use the new python script rename_mmetsp_blastdb.py (links upcoming soon)
+<code>
+python3 rename_mmetsp_blastdb.py
+</code>
+Error1:
+<code>
+#Note: if not python v3, it will be error
+ImportError: No module named ete3
+</code>
+Error2:
+<code>
+from PyQt5 import QtGui, QtCore
+RuntimeError: the PyQt5.QtCore and PyQt4.QtCore modules both wrap the QObject class
+</code>
+To solve above error:use python3
+<code>
+source activate Unicycler-python3
+pip install six
+</code>
+Fist time running the script on MacOS, it might generate an error. (https://stackoverflow.com/questions/50236117/scraping-ssl-certificate-verify-failed-error-for-http-en-wikipedia-org) This will need you to allow the Macintosh HD > Applications > Python3.8 > double click on "Install Certificates.command"
+<code>
+####@TE809 ~ % /Applications/Python\ 3.9/Install\ Certificates.command ; exit;
+ -- pip install --upgrade certifi
+Requirement already satisfied: certifi in /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages (2021.10.8)
+ -- removing any existing file or link
+ -- creating symlink to certifi certificate bundle
+ -- setting permissions
+ -- update complete
+Saving session...
+...copying shared history...
+...saving history...truncating history files...
+...completed.
+</code>
+<code>
+#running
+/misc/scratch2/###/arabidopsis/CAM_MMETSP@perun3> python3 rename_mmetsp_blastdb.py
+NCBI database not present yet (first time used?)
+Updating taxdump.tar.gz from NCBI FTP site (via HTTP)...
+Done. Parsing...
+Loading node names...
+2369147 names loaded.
+synonyms loaded.
+Loading nodes...
+</code>
+Then the latest taxdump.tar.gz will be downloaded via ETE3 package. The output file will be like this:
+<code>
+>Prymnesium_parvum@CP_0113232432_Eukaryota_Haptista_Haptophyta_Prymnesiophyceae_Prymnesiales_Prymnesiaceae_Prymnesium_Prymnesium_parvum_97485_Prymnesium_parvum
+XLRCLTRTPSLPSRLLATATPSRACPALSSALHRXASSAAFLRPSASASSCPSRCLSSTSRAPGASGSTQRAIPSXGGANGGWVNPLARPKGESLKKYGTDLNELARAGRLDPVIGRDEEIRRMVQVLSRRRKNNPVLIGEPGVGKTAIVEGLAQRIVDKEVPDSMRDARVIALDVGALVAGAKYRGEFEXRLKAVLADVSEAAGDVILFIDELHTVIGAGAADGAMDASNLLKPQLARGELSCVGATTLX
+>Prymnesium_parvum@CP_0113233658_Eukaryota_Haptista_Haptophyta_Prymnesiophyceae_Prymnesiales_Prymnesiaceae_Prymnesium_Prymnesium_parvum_97485_Prymnesium_parvum
+VASRXCEADDXAAAEGTRAVAMLPRLAIYLFAPLASASLVQLPQWPQRRLSPAGRLGLRPLPAAPRGSGQVQMVFDRFDRDAMRLVMDAQVEARKLGGSAVGTEHLLLAGTMQADAIQQALDRAGVKASGVRDAIRGPGGGSIPSLDGLFGLKAKDELLP
+</code>
+* As for NCBI-NR, the translated database header is like this:
+<code>
+>WP_048801694.1	ATP-dependent Clp protease ATP-binding subunit [Leuconostoc citreum]GEK62024.1 ATP-dependent Clp protease ATP-binding subunit ClpC [Leuconostoc citreum]
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM
+</code>
+<code>
+#So simply run
+python3 rename_ncbi_blastdb.py (link upcoming soon).
+#The input file will include:
+        fastaFile = '/db1/nr-nt-fasta-oct-2020/nr'
+        taxidFile = '/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021/acc2tax_prot_all.txt'
+</code>
+Then the desired output will be:
+<code>
+>Leuconostoc_citreum@NCBI_WP_048801694.1_Bacteria_Terrabacteria_group_Firmicutes_Bacilli_Lactobacillales_Lactobacillaceae_Leuconostoc_Leuconostoc_citreum_33964
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQ
+</code>
+Directory to renamed MMETSP: /misc/scratch2/###/###/mmetsp
+Then with the two renamed database available, you could merge then by 'cat'. Then build the new merged database via 'makeblastdb'. Then Blast them again.8-)
+**3.Minimizing the redundancy and complexity of large phylogenetic datasets**
+Finally, after using two different methods, we can touch on the topic we raised up at very beginning. Coarse and fine-tuning large phylogenetic datasets via reducing the redundancy and complexity.
+. **Coarse-tuning**: Let's start with the relatively simple one coarse-tuning via Treetrimmer (Maruyama et.al 2013)
+<code>
+ruby treetrimmer.rb sample/####_aligned_trimmed.newick sample/###_parameter_input.in sample/taxonomic_info.txt > ###_treetrimmer.newick
+</code>
+The "##..newick" and "###input.in" files can easily be prepared. The taxonomic_info.txt;however need to reformatted.
+<code>
+taxonomic_info.txt
+NP_563657	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis; Arabidopsis thaliana
+XP_002889406	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis; Arabidopsis lyrata; Arabidopsis lyrata subsp. lyrata
+</code>
+The taxonomic_info.txt can be created by acc2tax program. please read more from here:http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol3
+__Note: The acc2tax need the gene ID without version (e.g.NP_563657), so as the NCBI ID.__ Please find the usage of the program: http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery; http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol3
+<code>
+>WP_048801694.1	ATP-dependent Clp protease ATP-binding subunit [Leuconostoc citreum]GEK62024.1 ATP-dependent Clp protease ATP-binding subunit ClpC [Leuconostoc citreum]
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM
+</code>
+With the taxonomic_info.txt ready, you can get the tree file and another taxa file:
+<code>
+XP_026407875	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; Ranunculales; Papaveraceae; Papaveroideae; Papaver; Papaver somniferum	2	4
+XP_034682772	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; rosids incertae sedis; Vitales; Vitaceae; Viteae; Vitis; Vitis riparia	2
+</code>
+This tree give a rough tree diversity estimation.
+. **Fine-tuning**  Laura Eme (2012-14) written in Perl
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -o logfile
+#$ -pe threaded 20
+#export PATH=/scratch2/software/anaconda/bin:$PATH
+while read line
+do
+mafft --auto --thread 20 /misc/scratch2/####/$line.fasta >/misc/scratch2/####/aligned/$line.aligned.fasta
+/scratch2/software/anaconda/envs/bmge/bin/bmge -i /misc/scratch2/####/aligned/$line.aligned.fasta -t AA -m BLOSUM30 -of /misc/scratch2/xizhang/####/trimmed/$line.aligned.trimmed.fasta
+FastTree /misc/scratch2/####/trimmed/$line.aligned.trimmed.fasta > /misc/scratch2/####/fasttree/$line.aligned.trimmed.newick
+done <$1
+</code>
+let's say after the mafft,bmge,fasttree steps. You have the trimmed alignment and new wick tree. Now let's use the perl script to prune the leaves or trim the branches.
+<code>
+# These are files you will need. (links upcoming soon)
+# rm_inparal_rank.pl	taxa_rank.txt
+# taxa_not_remove.txt	trim2untrim.pl Instructions.txt lauralib.pm
+>perl rm_imparalogs <tree file> <alignment file> <distance cutoff> [taxa not to remove> <taxa rank>
+#Will remove sister sequences from the same rank. Will ignore taxa in the list "taxa not to remove".
+</code>
+It will yield the documents "###.removedSeq" and "###.fasttree".
+<code>
+> perl trim2untrim.pl [trimmed alignement] [untrimmed alignment]
+#Will remove sequences from the untrimmed alignement based on sequences present in the trimmed alignement
+</code>
-**3. Step-by-step Protocol**
+Based on the trimmed aligned seq, you can re-analysis more rigorous downstream IQ-tree analysis.
-**4. Limitation**
+Note: not all genes' species have taxa.This have nothing to do with the updates of NCBI taxonomy.
+The '0' in Gene name 'CP_0177652116_0_Stygamoeba_regulata_BSH-02190019' is not a NCBI taxid.
-<Last updated by Xi Zhang on Oct 6th,2021> upcoming
+<Last updated by Xi Zhang on Oct 6th,2021>