Differences

This shows you the differences between two versions of the page.

--- phylogeny_protocol2 [2021/10/10 13:12] – 134.190.232.9
+++ phylogeny_protocol2 [2022/02/07 15:21] (current) – 134.190.232.106
@@ Line 1: / Line 1: @@
+The GitHub resource for this protocol: https://github.com/zx0223winner/TreeTuner
 **Background**
@@ Line 249: / Line 251: @@
 Then use makeblastdb command to make the database compiled files. Considering the size of NCBI ~100Gb and the MMETSP (~10GB), I have not really tested myself. But i assume it might take at least three days running. Here, I will simply provide the method for you to feel free to use.
-As for MMETSP,
+* As for MMETSP, the translated database header is like this:
+<code>
+>CP_0113232432_97485_Prymnesium_parvum
+</code>
+To pull out the taxa information, use the new python script rename_mmetsp_blastdb.py (links upcoming soon)
+<code>
+python3 rename_mmetsp_blastdb.py
+</code>
+Error1:
+<code>
+#Note: if not python v3, it will be error
+ImportError: No module named ete3
+</code>
+Error2:
+<code>
+from PyQt5 import QtGui, QtCore
+RuntimeError: the PyQt5.QtCore and PyQt4.QtCore modules both wrap the QObject class
+</code>
+To solve above error:use python3
+<code>
+source activate Unicycler-python3
+pip install six
+</code>
+Fist time running the script on MacOS, it might generate an error. (https://stackoverflow.com/questions/50236117/scraping-ssl-certificate-verify-failed-error-for-http-en-wikipedia-org) This will need you to allow the Macintosh HD > Applications > Python3.8 > double click on "Install Certificates.command"
+<code>
+####@TE809 ~ % /Applications/Python\ 3.9/Install\ Certificates.command ; exit;
+ -- pip install --upgrade certifi
+Requirement already satisfied: certifi in /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages (2021.10.8)
+ -- removing any existing file or link
+ -- creating symlink to certifi certificate bundle
+ -- setting permissions
+ -- update complete
+Saving session...
+...copying shared history...
+...saving history...truncating history files...
+...completed.
+</code>
+<code>
+#running
+/misc/scratch2/###/arabidopsis/CAM_MMETSP@perun3> python3 rename_mmetsp_blastdb.py
+NCBI database not present yet (first time used?)
+Updating taxdump.tar.gz from NCBI FTP site (via HTTP)...
+Done. Parsing...
+Loading node names...
+2369147 names loaded.
+synonyms loaded.
+Loading nodes...
+</code>
+Then the latest taxdump.tar.gz will be downloaded via ETE3 package. The output file will be like this:
+<code>
+>Prymnesium_parvum@CP_0113232432_Eukaryota_Haptista_Haptophyta_Prymnesiophyceae_Prymnesiales_Prymnesiaceae_Prymnesium_Prymnesium_parvum_97485_Prymnesium_parvum
+XLRCLTRTPSLPSRLLATATPSRACPALSSALHRXASSAAFLRPSASASSCPSRCLSSTSRAPGASGSTQRAIPSXGGANGGWVNPLARPKGESLKKYGTDLNELARAGRLDPVIGRDEEIRRMVQVLSRRRKNNPVLIGEPGVGKTAIVEGLAQRIVDKEVPDSMRDARVIALDVGALVAGAKYRGEFEXRLKAVLADVSEAAGDVILFIDELHTVIGAGAADGAMDASNLLKPQLARGELSCVGATTLX
+>Prymnesium_parvum@CP_0113233658_Eukaryota_Haptista_Haptophyta_Prymnesiophyceae_Prymnesiales_Prymnesiaceae_Prymnesium_Prymnesium_parvum_97485_Prymnesium_parvum
+VASRXCEADDXAAAEGTRAVAMLPRLAIYLFAPLASASLVQLPQWPQRRLSPAGRLGLRPLPAAPRGSGQVQMVFDRFDRDAMRLVMDAQVEARKLGGSAVGTEHLLLAGTMQADAIQQALDRAGVKASGVRDAIRGPGGGSIPSLDGLFGLKAKDELLP
+</code>
+* As for NCBI-NR, the translated database header is like this:
+<code>
+>WP_048801694.1	ATP-dependent Clp protease ATP-binding subunit [Leuconostoc citreum]GEK62024.1 ATP-dependent Clp protease ATP-binding subunit ClpC [Leuconostoc citreum]
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM
+</code>
+<code>
+#So simply run
+python3 rename_ncbi_blastdb.py (link upcoming soon).
+#The input file will include:
+        fastaFile = '/db1/nr-nt-fasta-oct-2020/nr'
+        taxidFile = '/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021/acc2tax_prot_all.txt'
+</code>
+Then the desired output will be:
+<code>
+>Leuconostoc_citreum@NCBI_WP_048801694.1_Bacteria_Terrabacteria_group_Firmicutes_Bacilli_Lactobacillales_Lactobacillaceae_Leuconostoc_Leuconostoc_citreum_33964
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQ
+</code>
+Directory to renamed MMETSP: /misc/scratch2/###/###/mmetsp
+Then with the two renamed database available, you could merge then by 'cat'. Then build the new merged database via 'makeblastdb'. Then Blast them again.8-)
+**3.Minimizing the redundancy and complexity of large phylogenetic datasets**
+Finally, after using two different methods, we can touch on the topic we raised up at very beginning. Coarse and fine-tuning large phylogenetic datasets via reducing the redundancy and complexity.
+. **Coarse-tuning**: Let's start with the relatively simple one coarse-tuning via Treetrimmer (Maruyama et.al 2013)
+<code>
+ruby treetrimmer.rb sample/####_aligned_trimmed.newick sample/###_parameter_input.in sample/taxonomic_info.txt > ###_treetrimmer.newick
+</code>
+The "##..newick" and "###input.in" files can easily be prepared. The taxonomic_info.txt;however need to reformatted.
+<code>
+taxonomic_info.txt
+NP_563657	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis; Arabidopsis thaliana
+XP_002889406	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis; Arabidopsis lyrata; Arabidopsis lyrata subsp. lyrata
+</code>
+The taxonomic_info.txt can be created by acc2tax program. please read more from here:http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol3
+__Note: The acc2tax need the gene ID without version (e.g.NP_563657), so as the NCBI ID.__ Please find the usage of the program: http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery; http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol3
+<code>
+>WP_048801694.1	ATP-dependent Clp protease ATP-binding subunit [Leuconostoc citreum]GEK62024.1 ATP-dependent Clp protease ATP-binding subunit ClpC [Leuconostoc citreum]
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM
+</code>
+With the taxonomic_info.txt ready, you can get the tree file and another taxa file:
+<code>
+XP_026407875	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; Ranunculales; Papaveraceae; Papaveroideae; Papaver; Papaver somniferum	2	4
+XP_034682772	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; rosids incertae sedis; Vitales; Vitaceae; Viteae; Vitis; Vitis riparia	2
+</code>
+This tree give a rough tree diversity estimation.
+. **Fine-tuning**  Laura Eme (2012-14) written in Perl
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -o logfile
+#$ -pe threaded 20
+#export PATH=/scratch2/software/anaconda/bin:$PATH
+while read line
+do
+mafft --auto --thread 20 /misc/scratch2/####/$line.fasta >/misc/scratch2/####/aligned/$line.aligned.fasta
+/scratch2/software/anaconda/envs/bmge/bin/bmge -i /misc/scratch2/####/aligned/$line.aligned.fasta -t AA -m BLOSUM30 -of /misc/scratch2/xizhang/####/trimmed/$line.aligned.trimmed.fasta
+FastTree /misc/scratch2/####/trimmed/$line.aligned.trimmed.fasta > /misc/scratch2/####/fasttree/$line.aligned.trimmed.newick
+done <$1
+</code>
+let's say after the mafft,bmge,fasttree steps. You have the trimmed alignment and new wick tree. Now let's use the perl script to prune the leaves or trim the branches.
+<code>
+# These are files you will need. (links upcoming soon)
+# rm_inparal_rank.pl	taxa_rank.txt
+# taxa_not_remove.txt	trim2untrim.pl Instructions.txt lauralib.pm
+>perl rm_imparalogs <tree file> <alignment file> <distance cutoff> [taxa not to remove> <taxa rank>
+#Will remove sister sequences from the same rank. Will ignore taxa in the list "taxa not to remove".
+</code>
+It will yield the documents "###.removedSeq" and "###.fasttree".
+<code>
+> perl trim2untrim.pl [trimmed alignement] [untrimmed alignment]
+#Will remove sequences from the untrimmed alignement based on sequences present in the trimmed alignement
+</code>
+Based on the trimmed aligned seq, you can re-analysis more rigorous downstream IQ-tree analysis.
+Note: not all genes' species have taxa.This have nothing to do with the updates of NCBI taxonomy.
+The '0' in Gene name 'CP_0177652116_0_Stygamoeba_regulata_BSH-02190019' is not a NCBI taxid.
-<Last updated by Xi Zhang on Oct 6th,2021> upcoming
+<Last updated by Xi Zhang on Oct 6th,2021>