phylogeny_protocol2
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| phylogeny_protocol2 [2021/10/10 12:09] – 134.190.232.9 | phylogeny_protocol2 [2022/02/07 15:21] (current) – 134.190.232.106 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | The GitHub resource for this protocol: https:// | ||
| + | |||
| **Background** | **Background** | ||
| Line 124: | Line 126: | ||
| - The directory of the tool: / | - The directory of the tool: / | ||
| - The taxa information is updated by NCBI weekly via https:// | - The taxa information is updated by NCBI weekly via https:// | ||
| - | |||
| - | Note: If you only have hundreds of hits in a list, you can instead use the Taxonomy Common Tree - NCBI. Please read more from here: (http:// | ||
| < | < | ||
| Line 149: | Line 149: | ||
| </ | </ | ||
| - | Thanks for keeping reading until here, don't forget our goal is to acquire the header like this: | + | Note: If you only have hundreds of hits in a list, you can instead use the Taxonomy Common Tree - NCBI. Please read more from here: (http:// |
| + | |||
| + | Thanks for keeping reading until here, LOL don't forget our goal is to acquire the header like this: | ||
| < | < | ||
| Line 160: | Line 162: | ||
| So two python scripts(renaming_MMETSP.py and renaming_NCBI.py) are needed to proceed MMETSP and NCBI-nr, | So two python scripts(renaming_MMETSP.py and renaming_NCBI.py) are needed to proceed MMETSP and NCBI-nr, | ||
| + | |||
| + | Now I will introduce the usage of two Python scripts working on the naming issues. Let me clarify what you need to proceed this step again. | ||
| + | |||
| + | * Input files for MMETSP: | ||
| + | - A tabular file merged with all the genes' blast hits: merged_blast_mmetsp.txt | ||
| + | - The fasta seq for all the hits: out_mmetsp.fasta | ||
| + | - taxdump.tar.gz: | ||
| + | |||
| + | < | ||
| + | # merged_blast_mmetsp.txt | ||
| + | ATCG00670.1 CP_0184350226_38269_Gloeochaete_wittrockiana 3.32e-76 54.922 98 193 292 196 1 193 98 289 N/ | ||
| + | ATCG00670.1 CP_0184350226_38269_Gloeochaete_wittrockiana 3.32e-76 54.922 98 193 292 196 1 193 98 289 N/ | ||
| + | ATCG00670.1 CP_0184656932_38269_Gloeochaete_witrockiana 4.26e-76 54.922 98 193 | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | # out_mmetsp.fasta | ||
| + | > | ||
| + | XLRCLTRTPSLPSRLLATATPSRACPALSSALHRXASSAAFLRPSASASSCPSRCLSSTSRAPGASGSTQRAIPSXGGANGGWVNPLARPKGESLKKYGTDLNELARAGRLDPVIGRDEEIRRMVQVLSRRRKNNPVLIGEPGVGKTAIVEGLAQRIVDKEVPDSMRDARVIALDVGALVAGAKYRGEFE | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | # | ||
| + | citations.dmp division.dmp gencode.dmp names.dmp readme.txt | ||
| + | delnodes.dmp gc.prt merged.dmp nodes.dmp | ||
| + | </ | ||
| + | |||
| + | Then run the python script renaming_MMETSP.py(link upcoming soon). | ||
| + | |||
| + | < | ||
| + | python3 renaming_MMETSP.py | ||
| + | </ | ||
| + | |||
| + | The output fasta file will be like this: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | XNKTVGEKEKVDVGKKGGGGEEREMVGFVSDVFISLNLEWSRVGVGVVNSRGKRKVYAVGEFPGSSPGRTSVLVPQKEKVQKESKEKKRSHGGGKYKVLILNDAFNSMEYVAATLLRLIPGMTTELAWKVMKEAHENGAAVVGVWVFELAEAYCDAIQSAGIGSRIEPE | ||
| + | </ | ||
| + | |||
| + | |||
| + | * Input files for NCBI: | ||
| + | - A tabular file merged with all the genes' blast hits: merged_ncbi.txt | ||
| + | - The fasta seq for all the hits: new.merged_ncbi.fasta | ||
| + | - taxdump.tar.gz: | ||
| + | - merged_taxon.txt | ||
| + | |||
| + | Since NCBI has different header, so the merged_taxon.txt is needed. It was acquired via retrieve the all the hits' gene names from / | ||
| + | |||
| + | < | ||
| + | # merged_taxon.txt | ||
| + | A1BI15 A1BI15.1 290317 166201813 | ||
| + | A1WR17 A1WR17.1 391735 166214720 | ||
| + | A4J7L9 A4J7L9.1 349161 172044337 | ||
| + | A5D447 A5D447.1 370438 259585961 | ||
| + | A6SY74 A6SY74.1 375286 226706516 | ||
| + | A8F754 A8F754.1 416591 167008651 | ||
| + | A8WPG6 A8WPG6.2 6238 300681014 | ||
| + | </ | ||
| + | |||
| + | Then run the script (link upcoming): | ||
| + | < | ||
| + | python3 renaming_NCBI.py | ||
| + | </ | ||
| + | |||
| + | The output file will be look like this: | ||
| + | < | ||
| + | > | ||
| + | MESAICGRLALSPSTVFNSKPGEKHSLYKGPCGNHGFVMSLCASAVGKGGGLLDKPVIEKTTPGRESEFDLRKSRKMAPPYRVILHNDNFNRREYVVQVLMKVIPGMTLDNAVNIMQEAHHNGLAVVIICAQADAEEHCMQLRGNGLLSSIEPASGGGC | ||
| + | </ | ||
| + | |||
| + | Finally, (~ ̄▽ ̄)~ we have the desired BLAST hits headers from both MMETSP and NCBI-nr containing the hierarchical taxonomic terms. You can then merge the two files to play around what you are most familiar with seq aligning, seq trimming, and tree building. please read more http:// | ||
| + | |||
| + | But after all of these, you will need the step to color your newick tree. Please find this script color_newick_tree.py via http:// | ||
| **2. Method Two: blasting after renaming the header** | **2. Method Two: blasting after renaming the header** | ||
| - | Still remember we mentioned earlier due to the different naming strategy, we have to decide to blast before renaming the header or after? | + | Still remember we mentioned earlier due to the different naming strategy, we have to decide to blast before renaming the header or after? |
| - | draw_color_tree.py | + | < |
| + | > | ||
| + | XNKTVGEKEKVDVGKKGGGGEEREMVGFVSDVFISLNLEWSRVGVGVVNSRGKRKVYAVGEFPGSSPGRTSVLVPQKEKVQKESKEKKRSHGGGKYKVLILNDAFNSMEYVAATLLRLIPGMTTELAWKVMKEAHENGAAVVGVWVFELAEAYCDAIQSAGIGSRIEPE | ||
| + | |||
| + | > | ||
| + | MESAICGRLALSPSTVFNSKPGEKHSLYKGPCGNHGFVMSLCASAVGKGGGLLDKPVIEKTTPGRESEFDLRKSRKMAPPYRVILHNDNFNRREYVVQVLMKVIPGMTLDNAVNIMQEAHHNGLAVVIICAQADAEEHCMQLRGNGLLSSIEPASGGGC | ||
| + | </ | ||
| + | |||
| + | Then use makeblastdb command to make the database compiled files. Considering the size of NCBI ~100Gb and the MMETSP (~10GB), I have not really tested myself. But i assume it might take at least three days running. Here, I will simply provide the method for you to feel free to use. | ||
| + | |||
| + | * As for MMETSP, the translated database header is like this: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | </ | ||
| + | |||
| + | To pull out the taxa information, | ||
| + | |||
| + | < | ||
| + | python3 rename_mmetsp_blastdb.py | ||
| + | </ | ||
| + | |||
| + | Error1: | ||
| + | < | ||
| + | #Note: if not python v3, it will be error | ||
| + | ImportError: | ||
| + | </ | ||
| + | |||
| + | Error2: | ||
| + | < | ||
| + | from PyQt5 import QtGui, QtCore | ||
| + | RuntimeError: | ||
| + | </ | ||
| + | |||
| + | To solve above error:use python3 | ||
| + | < | ||
| + | source activate Unicycler-python3 | ||
| + | pip install six | ||
| + | </ | ||
| + | |||
| + | Fist time running the script on MacOS, it might generate an error. (https:// | ||
| + | |||
| + | < | ||
| + | ####@TE809 ~ % / | ||
| + | -- pip install --upgrade certifi | ||
| + | Requirement already satisfied: certifi in / | ||
| + | -- removing any existing file or link | ||
| + | -- creating symlink to certifi certificate bundle | ||
| + | -- setting permissions | ||
| + | -- update complete | ||
| + | Saving session... | ||
| + | ...copying shared history... | ||
| + | ...saving history...truncating history files... | ||
| + | ...completed. | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | #running | ||
| + | / | ||
| + | NCBI database not present yet (first time used?) | ||
| + | Updating taxdump.tar.gz from NCBI FTP site (via HTTP)... | ||
| + | Done. Parsing... | ||
| + | Loading node names... | ||
| + | 2369147 names loaded. | ||
| + | 253927 synonyms loaded. | ||
| + | Loading nodes... | ||
| + | </ | ||
| + | |||
| + | Then the latest taxdump.tar.gz will be downloaded via ETE3 package. The output file will be like this: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | XLRCLTRTPSLPSRLLATATPSRACPALSSALHRXASSAAFLRPSASASSCPSRCLSSTSRAPGASGSTQRAIPSXGGANGGWVNPLARPKGESLKKYGTDLNELARAGRLDPVIGRDEEIRRMVQVLSRRRKNNPVLIGEPGVGKTAIVEGLAQRIVDKEVPDSMRDARVIALDVGALVAGAKYRGEFEXRLKAVLADVSEAAGDVILFIDELHTVIGAGAADGAMDASNLLKPQLARGELSCVGATTLX | ||
| + | > | ||
| + | VASRXCEADDXAAAEGTRAVAMLPRLAIYLFAPLASASLVQLPQWPQRRLSPAGRLGLRPLPAAPRGSGQVQMVFDRFDRDAMRLVMDAQVEARKLGGSAVGTEHLLLAGTMQADAIQQALDRAGVKASGVRDAIRGPGGGSIPSLDGLFGLKAKDELLP | ||
| + | </ | ||
| + | |||
| + | * As for NCBI-NR, the translated database header is like this: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | #So simply run | ||
| + | python3 rename_ncbi_blastdb.py (link upcoming soon). | ||
| + | #The input file will include: | ||
| + | fastaFile = '/ | ||
| + | taxidFile = '/ | ||
| + | </ | ||
| + | |||
| + | Then the desired output will be: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQ | ||
| + | </ | ||
| + | |||
| + | Directory to renamed MMETSP: / | ||
| + | |||
| + | Then with the two renamed database available, you could merge then by ' | ||
| + | |||
| + | **3.Minimizing the redundancy and complexity of large phylogenetic datasets** | ||
| + | |||
| + | Finally, after using two different methods, we can touch on the topic we raised up at very beginning. Coarse and fine-tuning large phylogenetic datasets via reducing the redundancy and complexity. | ||
| + | |||
| + | 1. **Coarse-tuning**: | ||
| + | |||
| + | < | ||
| + | ruby treetrimmer.rb sample/#### | ||
| + | </ | ||
| + | |||
| + | The "## | ||
| + | |||
| + | < | ||
| + | taxonomic_info.txt | ||
| + | NP_563657 Eukaryota; | ||
| + | XP_002889406 Eukaryota; | ||
| + | </ | ||
| + | |||
| + | The taxonomic_info.txt can be created by acc2tax program. please read more from here: | ||
| + | |||
| + | __Note: The acc2tax need the gene ID without version (e.g.NP_563657), | ||
| + | |||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM | ||
| + | </ | ||
| + | |||
| + | With the taxonomic_info.txt ready, you can get the tree file and another taxa file: | ||
| + | < | ||
| + | XP_026407875 Eukaryota; | ||
| + | XP_034682772 Eukaryota; | ||
| + | </ | ||
| + | |||
| + | This tree give a rough tree diversity estimation. | ||
| + | |||
| + | |||
| + | 2. **Fine-tuning** | ||
| + | |||
| + | < | ||
| + | |||
| + | # | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -o logfile | ||
| + | #$ -pe threaded 20 | ||
| + | #export PATH=/ | ||
| + | |||
| + | while read line | ||
| + | do | ||
| + | |||
| + | mafft --auto --thread 20 / | ||
| + | |||
| + | / | ||
| + | |||
| + | FastTree / | ||
| + | |||
| + | done <$1 | ||
| + | </ | ||
| + | |||
| + | let's say after the mafft, | ||
| + | |||
| + | < | ||
| + | # These are files you will need. (links upcoming soon) | ||
| + | # rm_inparal_rank.pl taxa_rank.txt | ||
| + | # taxa_not_remove.txt trim2untrim.pl Instructions.txt lauralib.pm | ||
| + | |||
| + | >perl rm_imparalogs <tree file> < | ||
| + | #Will remove sister sequences from the same rank. Will ignore taxa in the list "taxa not to remove" | ||
| + | </ | ||
| + | |||
| + | It will yield the documents "### | ||
| + | |||
| + | < | ||
| + | > perl trim2untrim.pl [trimmed alignement] [untrimmed alignment] | ||
| + | #Will remove sequences from the untrimmed alignement based on sequences present in the trimmed alignement | ||
| + | </ | ||
| - | **3. Step-by-step Protocol** | + | Based on the trimmed aligned seq, you can re-analysis more rigorous downstream IQ-tree analysis. |
| - | **4. Limitation** | + | Note: not all genes' species have taxa.This have nothing to do with the updates of NCBI taxonomy. |
| + | The ' | ||
| - | <Last updated by Xi Zhang on Oct 6th, | + | <Last updated by Xi Zhang on Oct 6th, |
phylogeny_protocol2.1633878578.txt.gz · Last modified: by 134.190.232.9
