Differences

This shows you the differences between two versions of the page.

--- phylogeny_protocol2 [2021/10/10 14:50] – 134.190.232.9
+++ phylogeny_protocol2 [2022/02/07 15:21] (current) – 134.190.232.106
@@ Line 1: / Line 1: @@
+The GitHub resource for this protocol: https://github.com/zx0223winner/TreeTuner
 **Background**
@@ Line 318: / Line 320: @@
 * As for NCBI-NR, the translated database header is like this:
+<code>
+>WP_048801694.1	ATP-dependent Clp protease ATP-binding subunit [Leuconostoc citreum]GEK62024.1 ATP-dependent Clp protease ATP-binding subunit ClpC [Leuconostoc citreum]
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM
+</code>
+<code>
+#So simply run
+python3 rename_ncbi_blastdb.py (link upcoming soon).
+#The input file will include:
+        fastaFile = '/db1/nr-nt-fasta-oct-2020/nr'
+        taxidFile = '/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021/acc2tax_prot_all.txt'
+</code>
+Then the desired output will be:
+<code>
+>Leuconostoc_citreum@NCBI_WP_048801694.1_Bacteria_Terrabacteria_group_Firmicutes_Bacilli_Lactobacillales_Lactobacillaceae_Leuconostoc_Leuconostoc_citreum_33964
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQ
+</code>
+Directory to renamed MMETSP: /misc/scratch2/###/###/mmetsp
+Then with the two renamed database available, you could merge then by 'cat'. Then build the new merged database via 'makeblastdb'. Then Blast them again.8-)
+**3.Minimizing the redundancy and complexity of large phylogenetic datasets**
+Finally, after using two different methods, we can touch on the topic we raised up at very beginning. Coarse and fine-tuning large phylogenetic datasets via reducing the redundancy and complexity.
+. **Coarse-tuning**: Let's start with the relatively simple one coarse-tuning via Treetrimmer (Maruyama et.al 2013)
+<code>
+ruby treetrimmer.rb sample/####_aligned_trimmed.newick sample/###_parameter_input.in sample/taxonomic_info.txt > ###_treetrimmer.newick
+</code>
+The "##..newick" and "###input.in" files can easily be prepared. The taxonomic_info.txt;however need to reformatted.
+<code>
+taxonomic_info.txt
+NP_563657	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis; Arabidopsis thaliana
+XP_002889406	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis; Arabidopsis lyrata; Arabidopsis lyrata subsp. lyrata
+</code>
+The taxonomic_info.txt can be created by acc2tax program. please read more from here:http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol3
+__Note: The acc2tax need the gene ID without version (e.g.NP_563657), so as the NCBI ID.__ Please find the usage of the program: http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery; http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol3
+<code>
+>WP_048801694.1	ATP-dependent Clp protease ATP-binding subunit [Leuconostoc citreum]GEK62024.1 ATP-dependent Clp protease ATP-binding subunit ClpC [Leuconostoc citreum]
+MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM
+</code>
+With the taxonomic_info.txt ready, you can get the tree file and another taxa file:
+<code>
+XP_026407875	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; Ranunculales; Papaveraceae; Papaveroideae; Papaver; Papaver somniferum	2	4
+XP_034682772	Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; rosids incertae sedis; Vitales; Vitaceae; Viteae; Vitis; Vitis riparia	2
+</code>
+This tree give a rough tree diversity estimation.
+. **Fine-tuning**  Laura Eme (2012-14) written in Perl
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -o logfile
+#$ -pe threaded 20
+#export PATH=/scratch2/software/anaconda/bin:$PATH
+while read line
+do
+mafft --auto --thread 20 /misc/scratch2/####/$line.fasta >/misc/scratch2/####/aligned/$line.aligned.fasta
+/scratch2/software/anaconda/envs/bmge/bin/bmge -i /misc/scratch2/####/aligned/$line.aligned.fasta -t AA -m BLOSUM30 -of /misc/scratch2/xizhang/####/trimmed/$line.aligned.trimmed.fasta
+FastTree /misc/scratch2/####/trimmed/$line.aligned.trimmed.fasta > /misc/scratch2/####/fasttree/$line.aligned.trimmed.newick
+done <$1
+</code>
+let's say after the mafft,bmge,fasttree steps. You have the trimmed alignment and new wick tree. Now let's use the perl script to prune the leaves or trim the branches.
+<code>
+# These are files you will need. (links upcoming soon)
+# rm_inparal_rank.pl	taxa_rank.txt
+# taxa_not_remove.txt	trim2untrim.pl Instructions.txt lauralib.pm
+>perl rm_imparalogs <tree file> <alignment file> <distance cutoff> [taxa not to remove> <taxa rank>
+#Will remove sister sequences from the same rank. Will ignore taxa in the list "taxa not to remove".
+</code>
+It will yield the documents "###.removedSeq" and "###.fasttree".
+<code>
+> perl trim2untrim.pl [trimmed alignement] [untrimmed alignment]
+#Will remove sequences from the untrimmed alignement based on sequences present in the trimmed alignement
+</code>
+Based on the trimmed aligned seq, you can re-analysis more rigorous downstream IQ-tree analysis.
+Note: not all genes' species have taxa.This have nothing to do with the updates of NCBI taxonomy.
+The '0' in Gene name 'CP_0177652116_0_Stygamoeba_regulata_BSH-02190019' is not a NCBI taxid.
-<Last updated by Xi Zhang on Oct 6th,2021> upcoming
+<Last updated by Xi Zhang on Oct 6th,2021>