phylogeny_protocol2
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| phylogeny_protocol2 [2021/10/10 14:50] – 134.190.232.9 | phylogeny_protocol2 [2022/02/07 15:21] (current) – 134.190.232.106 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | The GitHub resource for this protocol: https:// | ||
| + | |||
| **Background** | **Background** | ||
| Line 318: | Line 320: | ||
| * As for NCBI-NR, the translated database header is like this: | * As for NCBI-NR, the translated database header is like this: | ||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | #So simply run | ||
| + | python3 rename_ncbi_blastdb.py (link upcoming soon). | ||
| + | #The input file will include: | ||
| + | fastaFile = '/ | ||
| + | taxidFile = '/ | ||
| + | </ | ||
| + | |||
| + | Then the desired output will be: | ||
| + | |||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQ | ||
| + | </ | ||
| + | |||
| + | Directory to renamed MMETSP: / | ||
| + | |||
| + | Then with the two renamed database available, you could merge then by ' | ||
| + | |||
| + | **3.Minimizing the redundancy and complexity of large phylogenetic datasets** | ||
| + | |||
| + | Finally, after using two different methods, we can touch on the topic we raised up at very beginning. Coarse and fine-tuning large phylogenetic datasets via reducing the redundancy and complexity. | ||
| + | |||
| + | 1. **Coarse-tuning**: | ||
| + | |||
| + | < | ||
| + | ruby treetrimmer.rb sample/#### | ||
| + | </ | ||
| + | |||
| + | The "## | ||
| + | |||
| + | < | ||
| + | taxonomic_info.txt | ||
| + | NP_563657 Eukaryota; | ||
| + | XP_002889406 Eukaryota; | ||
| + | </ | ||
| + | |||
| + | The taxonomic_info.txt can be created by acc2tax program. please read more from here: | ||
| + | |||
| + | __Note: The acc2tax need the gene ID without version (e.g.NP_563657), | ||
| + | |||
| + | < | ||
| + | > | ||
| + | MDNKYTSSAQNVLVLAQEQAKYFKHQAVGTEHLLLALAIEKEGIASKILGQFNVTDDDIREEIEHFTGYGM | ||
| + | </ | ||
| + | |||
| + | With the taxonomic_info.txt ready, you can get the tree file and another taxa file: | ||
| + | < | ||
| + | XP_026407875 Eukaryota; | ||
| + | XP_034682772 Eukaryota; | ||
| + | </ | ||
| + | |||
| + | This tree give a rough tree diversity estimation. | ||
| + | |||
| + | |||
| + | 2. **Fine-tuning** | ||
| + | |||
| + | < | ||
| + | |||
| + | #!/bin/bash | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -o logfile | ||
| + | #$ -pe threaded 20 | ||
| + | #export PATH=/ | ||
| + | |||
| + | while read line | ||
| + | do | ||
| + | |||
| + | mafft --auto --thread 20 / | ||
| + | |||
| + | / | ||
| + | |||
| + | FastTree / | ||
| + | |||
| + | done <$1 | ||
| + | </ | ||
| + | |||
| + | let's say after the mafft, | ||
| + | |||
| + | < | ||
| + | # These are files you will need. (links upcoming soon) | ||
| + | # rm_inparal_rank.pl taxa_rank.txt | ||
| + | # taxa_not_remove.txt trim2untrim.pl Instructions.txt lauralib.pm | ||
| + | |||
| + | >perl rm_imparalogs <tree file> < | ||
| + | #Will remove sister sequences from the same rank. Will ignore taxa in the list "taxa not to remove" | ||
| + | </ | ||
| + | |||
| + | It will yield the documents "### | ||
| + | |||
| + | < | ||
| + | > perl trim2untrim.pl [trimmed alignement] [untrimmed alignment] | ||
| + | #Will remove sequences from the untrimmed alignement based on sequences present in the trimmed alignement | ||
| + | </ | ||
| + | |||
| + | Based on the trimmed aligned seq, you can re-analysis more rigorous downstream IQ-tree analysis. | ||
| + | |||
| + | Note: not all genes' species have taxa.This have nothing to do with the updates of NCBI taxonomy. | ||
| + | The ' | ||
| - | <Last updated by Xi Zhang on Oct 6th, | + | <Last updated by Xi Zhang on Oct 6th, |
phylogeny_protocol2.1633888224.txt.gz · Last modified: by 134.190.232.9
