phylogeny_protocol2
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| phylogeny_protocol2 [2021/10/10 13:12] – 134.190.232.9 | phylogeny_protocol2 [2022/02/07 15:21] (current) – 134.190.232.106 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | The GitHub resource for this protocol: https:// | ||
| + | |||
| **Background** | **Background** | ||
| Line 249: | Line 251: | ||
| Then use makeblastdb command to make the database compiled files. Considering the size of NCBI ~100Gb and the MMETSP (~10GB), I have not really tested myself. But i assume it might take at least three days running. Here, I will simply provide the method for you to feel free to use. | Then use makeblastdb command to make the database compiled files. Considering the size of NCBI ~100Gb and the MMETSP (~10GB), I have not really tested myself. But i assume it might take at least three days running. Here, I will simply provide the method for you to feel free to use. | ||
| - | As for MMETSP, | + | * As for MMETSP, |
| + | < | ||
| + | > | ||
| + | </ | ||
| + | To pull out the taxa information, | ||
| + | |||
| + | < | ||
| + | python3 rename_mmetsp_blastdb.py | ||
| + | </ | ||
| + | |||
| + | Error1: | ||
| + | < | ||
| + | #Note: if not python v3, it will be error | ||
| + | ImportError: | ||
| + | </ | ||
| + | |||
| + | Error2: | ||
| + | < | ||
| + | from PyQt5 import QtGui, QtCore | ||
| + | RuntimeError: | ||
| + | </ | ||
| + | |||
| + | To solve above error:use python3 | ||
| + | < | ||
| + | source activate Unicycler-python3 | ||
| + | pip install six | ||
| + | </ | ||
| + | |||
| + | Fist time running the script on MacOS, it might generate an error. (https:// | ||
| + | |||
| + | < | ||
| + | ####@TE809 ~ % / | ||
| + | -- pip install --upgrade certifi | ||
| + | Requirement already satisfied: certifi in / | ||
| + | -- removing any existing file or link | ||
| + | -- creating symlink to certifi certificate bundle | ||
| + | -- setting permissions | ||
| + | -- update complete | ||
| + | Saving session... | ||
| + | ...copying shared history... | ||
| + | ...saving history...truncating history files... | ||
| + | ...completed. | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | #running | ||
| + | / | ||
| + | NCBI database not present yet (first time used?) | ||
| + | Updating taxdump.tar.gz from NCBI FTP site (via HTTP)... | ||
| + | Done. Parsing... | ||
| + | Loading node names... | ||
| + | 2369147 names loaded. | ||
| + | 253927 synonyms loaded. | ||
| + | Loading nodes... | ||
| + | </ | ||
| + | |||
| + | Then the latest taxdump.tar.gz will be downloaded via ETE3 package. The output file will be like this: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | XLRCLTRTPSLPSRLLATATPSRACPALSSALHRXASSAAFLRPSASASSCPSRCLSSTSRAPGASGSTQRAIPSXGGANGGWVNPLARPKGESLKKYGTDLNELARAGRLDPVIGRDEEIRRMVQVLSRRRKNNPVLIGEPGVGKTAIVEGLAQRIVDKEVPDSMRDARVIALDVGALVAGAKYRGEFEXRLKAVLADVSEAAGDVILFIDELHTVIGAGAADGAMDASNLLKPQLARGELSCVGATTLX | ||
| + | > | ||
| + | VASRXCEADDXAAAEGTRAVAMLPRLAIYLFAPLASASLVQLPQWPQRRLSPAGRLGLRPLPAAPRGSGQVQMVFDRFDRDAMRLVMDAQVEARKLGGSAVGTEHLLLAGTMQADAIQQALDRAGVKASGVRDAIRGPGGGSIPSLDGLFGLKAKDELLP | ||
| + | </ | ||
| + | |||
| + | * As for NCBI-NR, the translated database header is like this: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | #So simply run | ||
| + | python3 rename_ncbi_blastdb.py (link upcoming soon). | ||
| + | #The input file will include: | ||
| + | fastaFile = '/ | ||
| + | taxidFile = '/ | ||
| + | </ | ||
| + | |||
| + | Then the desired output will be: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQ | ||
| + | </ | ||
| + | |||
| + | Directory to renamed MMETSP: / | ||
| + | |||
| + | Then with the two renamed database available, you could merge then by ' | ||
| + | |||
| + | **3.Minimizing the redundancy and complexity of large phylogenetic datasets** | ||
| + | |||
| + | Finally, after using two different methods, we can touch on the topic we raised up at very beginning. Coarse and fine-tuning large phylogenetic datasets via reducing the redundancy and complexity. | ||
| + | |||
| + | 1. **Coarse-tuning**: | ||
| + | |||
| + | < | ||
| + | ruby treetrimmer.rb sample/#### | ||
| + | </ | ||
| + | |||
| + | The "## | ||
| + | |||
| + | < | ||
| + | taxonomic_info.txt | ||
| + | NP_563657 Eukaryota; | ||
| + | XP_002889406 Eukaryota; | ||
| + | </ | ||
| + | |||
| + | The taxonomic_info.txt can be created by acc2tax program. please read more from here: | ||
| + | |||
| + | __Note: The acc2tax need the gene ID without version (e.g.NP_563657), | ||
| + | |||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM | ||
| + | </ | ||
| + | |||
| + | With the taxonomic_info.txt ready, you can get the tree file and another taxa file: | ||
| + | < | ||
| + | XP_026407875 Eukaryota; | ||
| + | XP_034682772 Eukaryota; | ||
| + | </ | ||
| + | |||
| + | This tree give a rough tree diversity estimation. | ||
| + | |||
| + | |||
| + | 2. **Fine-tuning** | ||
| + | |||
| + | < | ||
| + | |||
| + | #!/bin/bash | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -o logfile | ||
| + | #$ -pe threaded 20 | ||
| + | #export PATH=/ | ||
| + | |||
| + | while read line | ||
| + | do | ||
| + | |||
| + | mafft --auto --thread 20 / | ||
| + | |||
| + | / | ||
| + | |||
| + | FastTree / | ||
| + | |||
| + | done <$1 | ||
| + | </ | ||
| + | |||
| + | let's say after the mafft, | ||
| + | |||
| + | < | ||
| + | # These are files you will need. (links upcoming soon) | ||
| + | # rm_inparal_rank.pl taxa_rank.txt | ||
| + | # taxa_not_remove.txt trim2untrim.pl Instructions.txt lauralib.pm | ||
| + | |||
| + | >perl rm_imparalogs <tree file> < | ||
| + | #Will remove sister sequences from the same rank. Will ignore taxa in the list "taxa not to remove" | ||
| + | </ | ||
| + | |||
| + | It will yield the documents "### | ||
| + | |||
| + | < | ||
| + | > perl trim2untrim.pl [trimmed alignement] [untrimmed alignment] | ||
| + | #Will remove sequences from the untrimmed alignement based on sequences present in the trimmed alignement | ||
| + | </ | ||
| + | Based on the trimmed aligned seq, you can re-analysis more rigorous downstream IQ-tree analysis. | ||
| + | Note: not all genes' species have taxa.This have nothing to do with the updates of NCBI taxonomy. | ||
| + | The ' | ||
| - | <Last updated by Xi Zhang on Oct 6th, | + | <Last updated by Xi Zhang on Oct 6th, |
phylogeny_protocol2.1633882340.txt.gz · Last modified: by 134.190.232.9
