This is an old revision of the document!
Background
This is definitely a big topic, I sticked around the topic for nearly a month trying to figure out an easy and straight way for future ICG trainees. Although there are some online tools available, such as Treemmer written in python (2018); TreeTrimmer(2013) written in ruby; Laura Eme (2012-14) written in Perl, the pros and cons of them are obvious. As (Menardo et.al 2018) who developed Treemmer summarized:
- Tree pruner is a tool to manually select and prune leaves/branches from a phylogenetic tree. Tree pruner can be very useful for manual curation, however it is not an automatic method, and it relies on subjective decisions by the users.
- Treetrimmer automatically reduces the number of leaves in a tree to few representatives for each user-defined operational taxonomical unit (OTU), like genus or species. However it is based on user-defined OTU.
- Treemmer, a simple tool based on an iterative algorithm to reduce size and evaluate redundancy of phylogenetic datasets. However, it is not related to hierarchical taxonomic terms, because he thought taxonomic categories are only a very rough proxy for genetic diversity. (The third one is what I added)
- SHOOT.bio a single gene family tree. Shoot only offers a range of curated databases. Because they believe: the motivation for encouraging the use of curated databases over defaulting to the BLAST nr database is that these curated databases should make it easier to picture the evolution of the whole gene family rather than overwhelming the tree with thousands of near-identical hits.
So, I am thinking is there a better way to balance the methods/tools, I am not trying to re-programming and deny what they did. Actually, they did good job on some specific questions. An idea ,suddenly, coming into my mind, which might looks stupid, YES, I decided to name a new pipeline called TreeTuner which aims to combe some of the above methods and realize the coarse and fine-tuning of large phylogenetic datasets via reducing the redundancy and complexity. This might relief my goal on balancing the multiple available methods/tools a little bit.
References:
- Tree pruner: An efficient tool for selecting data from a biased genetic database (Krishnamoorthy et.al 2012)
- TreeTrimmer: A method for phylogenetic dataset size reduction (Maruyama et.al 2013)
- Treemmer: A tool to reduce large phylogenetic datasets with minimal loss of diversity (Menardo et.al 2018)
- SHOOT.bio: sprout a branch on the tree of life (David Emms et.al.2021) Preprint. https://shoot.bio
- TreeTuner: A pipeline to coarse and fine-tuning large phylogenetic datasets via reducing the redundancy and complexity (Zhang et.al 2021)
Before you begin
It was tricky than I thought even at the first step. Do I blast the data before the renaming the header or after? That does matter. If I blast before renaming the header, I need to pull out the BLAST hits sequence and rename them in a hierarchical taxonomic way. If I blast after the process of renaming the header, I need to reming all the database header in a hierarchical taxonomic way. Lost your mind? never mind! I will show you the difference here.
Here, We will deal with two databases: MMETSP and NCBI-nr. Why MMETSP? Because it included the marine microbial eukaryotic transcripts where NCBI-nr don't have all i.e., lots of algae, lots of Haptophytes, Rhodoophytes. It is also annoying for the different formats of headers. For example:
#MMETSP(nucleotide): head /db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa >MMETSP1065-20121228|15685 Amphiprora_paludosa_Strain_CCMP125 CATCGAGTTCATCATCATCGGTGGAATTATCACTGTGATG #MMETSP(translated protein): head /misc/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas >CP_0114609912_95228_Vannella_sp_DIVA3_517_6_12 XEIVKGFKKVADLPDAVFGRFVTATFNIVL
#NCBI-nr #/db1/blastdb-sep1-2021/nr.pal (V5 format, updated Sep 2021)(Blast) # The latest full nr I used: /misc/scratch3/xizhang/nr-fasta-sep-2021/nr.gz head /db1/nr-nt-fasta-oct-2020/nr >PYI97175.1 lysine 2,3-aminomutase [Verrucomicrobia bacterium]PYJ33862.1 lysine 2,3-aminomutase [Verrucomicrobia bacterium] MITPVSEEGNGKRFVSHAPGFWPQTPTELWNDWKWQLKNRVTSLAHLEQHLDLSDEERSGVLLSGDKLALAVTPHFFNLV
I know, it is complicated for each datasets, like the underscore line between the species name (Vannella_sp_DIVA3_517_6_12)is to make sure the BLAST can regard them as a whole for gene name.So that your gene header can contain the taxa information, so that you can pull out the hierarchical taxonomic information. If you look like NCBI-nr, they even don't contain taxa information and the header is even not connected with underscore. That is the purpose of why we have to do the renaming step.
No matter blasting before renaming the header or after, the purpose is the same to acquire the BLAST hits (homolog seqs) from two databases(MMETSP and NCBI). Check the real example below:
>Compsopogon_coeruleus@CP_0184679990_Eukaryota_Rhodophyta_Compsopogonophyceae_Compsopogonales_Compsopogonaceae_Compsopogon_Compsopogon_caeruleus_31354_Compsopogon_coeruleus XNKTVGEKEKVDVGKKGGGGEEREMVGFVSDVFISLNLEWSRVGVGVVNSRGKRKVYAVGEFPGSSPGRTSVLVPQKEKVQKESKEKKRSHGGGKYKVLILNDAFNSMEYVAATLLRLIPGMTTELAWKVMKEAHENGAAVVGVWVFELAEAYCDAIQSAGIGSRIEPE >Tarenaya_hassleriana@NCBI_XP_010551290.1_Eukaryota_Viridiplantae_Streptophyta_Streptophytina_Embryophyta_Tracheophyta_Euphyllophyta_Spermatophyta_Magnoliopsida_Mesangiospermae_eudicotyledons_Gunneridae_Pentapetalae_rosids_malvids_Brassicales_Cleomaceae_New_World_clade_Tarenaya_Tarenaya_hassleriana_28532 MESAICGRLALSPSTVFNSKPGEKHSLYKGPCGNHGFVMSLCASAVGKGGGLLDKPVIEKTTPGRESEFDLRKSRKMAPPYRVILHNDNFNRREYVVQVLMKVIPGMTLDNAVNIMQEAHHNGLAVVIICAQADAEEHCMQLRGNGLLSSIEPASGGGC
1. Method One: blasting before renaming the header
- Step one: BLAST
If you decide to blast before renaming, you can follow the other Perun blast guide (e.g.,http://129.173.88.134:81/dokuwiki/doku.php?id=blast_protocol) to either BLAST the MMETSP database separately or as a whole. Due to the size of the NCBI-nr database (~100 GB), if you merge the NCBI with MMETSP (~10GB), you need to rebuild the Database file such as:
nr.14.phd nr.14.pin nr.14.ppi nr.14.phi nr.14.pog nr.14.psq nr.14.phr nr.14.ppd nr.14.tar.gz.md5
This will be terrible long time doing that, 'cause I have not tested it. I also think it is not worth it unless you want to stick with one version to do intense studies in a period of time. The NCBI-nr version is updated quickly with time. So I prefer to download the latest version of NCBI-Nr, where they have the pre-compiled database files. Use can BLASTp against the database without using makeblastdb command.
Let's say now you have finished the NCBI BLAST and MMETSP blast without renaming the database file at first. You will acquire a bunch of BLAST hits like this (by default, you are using tabular format output):
#MMETSP ATCG00670.1 CP_0184350226_38269_Gloeochaete_wittrockiana 3.32e-76 54.922 98 193 292 196 1 193 98 289 N/A N/A ATCG00670.1 CP_0184350226_38269_Gloeochaete_wittrockiana 3.32e-76 54.922 98 193
#NCBI ATCG00670.1 ref|YP_009356046.1| 8.33e-142 98.980 100 196 196 196 1 196 1 196 ATP-dependent protease subunit [Matthiola incana] ATP-dependent protease subunit [Matthiola incana]<>ATP-dependent protease subunit [Matthiola longipetala]<>ATP-dependent protease subunit [Matthiola incana]<>ATP-dependent protease subunit [Matthiola longipetala] ATCG00670.1 ref|YP_009261721.1| 9.51e-142 98.980 100 196 196 196 1 196 1 196 ATP-dependent Clp protease proteolytic subunit [Pugionium dolabratum] ATP-dependent Clp protease proteolytic subunit [Pugionium dolabratum]<>ATP-dependent Clp protease proteolytic subunit [Pugionium cornutum]<>ATP-dependent Clp protease proteolytic subunit [Pugionium dolabratum]<>ATP-dependent Clp protease proteolytic subunit [Pugionium cornutum]<>ATP-dependent protease subunit [Pugionium pterocarpum] ATCG00670.1 gb|QKK48680.1| 9.72e-142 98.980 100 196 196 196 1 196 1 196 ATP-dependent protease subunit [Robeschia schimperii] ATP-dependent protease subunit [Robeschia schimperii]
Note: It is tricky to blast multiple sequences against database, where you can first split up the fasta file into sub-files to speed up and improve the CPU usage efficiency. Please find more here:http://129.173.88.134:81/dokuwiki/doku.php?id=bioinformatics_tools2
- Step TWO: retrieve the Fasta sequence from the hits
Why need to do this step, because without the hits sequence at hand, the MAFFT,BMGE,iqtree analysis is not available. As for MMETSP, you can use the python3 script: index_header_to_seq.py. I introduced in (http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol)
python3 index_header_to_seq.py /misc/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas name_list.txt output.fa #The name list file includes a list of gene name like this: CP_0184350226_38269_Gloeochaete_wittrockiana
As for the NCBI, which is huge(~100gb), so the python script does not work well on it. Instead, we can use the powerful NCBI built-in tool. seqkit to do this (a cross-platform and ultrafast toolkit for FASTA/Q file manipulation).
#!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 2 export PATH=/opt/perun/bin/:$PATH seqtk subseq /db1/nr-nt-fasta-oct-2020/nr ncbi_id.txt >###_ncbi.fasta
Note: If you only have hundreds of hits in a list, you can instead use the NCBI Batch entrez. Please read more from here: http://129.173.88.134:81/dokuwiki/doku.php?id=phylogeny_protocol
- Step Three: retrieve the taxa information from the protein sequence header
As for the MMETSP gene header which already contain the taxa information. e.g. “CP_0184350226_38269_Gloeochaete_wittrockiana” taxa ID: 38269. But for NCBI ID, it only contains the protein ID like this: ref|YP_009356046.1|. So you have to use the YP_009356046.1 to retrieve the taxa ID for this species.
- Please find the usage of acc2tax here: http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery
- acc2tax :Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. https://github.com/richardmleggett/acc2tax
- /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021 (Up to date Sep 20, 2021)
- The taxa information is updated by NCBI weekly via https://ftp.ncbi.nih.gov/pub/taxonomy/; https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
acc2tax -i /db1/extra-data-sets/Acc2tax/name_list.txt -p -d /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021 -o taxonomy.out
# name_list.txt XP_011133305 KMU79707 XP_002963095 XP_021355212 WP_058500112 XP_009035520 WP_072908694 WP_057954045
# taxonomy.out XP_011133305 cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes KMU79707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703
Thanks for keeping reading until here, don't forget our goal is to acquire the header like this:
>Compsopogon_coeruleus@CP_0184679990_Eukaryota_Rhodophyta_Compsopogonophyceae_Compsopogonales_Compsopogonaceae_Compsopogon_Compsopogon_caeruleus_31354_Compsopogon_coeruleus XNKTVGEKEKVDVGKKGGGGEEREMVGFVSDVFISLNLEWSRVGVGVVNSRGKRKVYAVGEFPGSSPGRTSVLVPQKEKVQKESKEKKRSHGGGKYKVLILNDAFNSMEYVAATLLRLIPGMTTELAWKVMKEAHENGAAVVGVWVFELAEAYCDAIQSAGIGSRIEPE >Tarenaya_hassleriana@NCBI_XP_010551290.1_Eukaryota_Viridiplantae_Streptophyta_Streptophytina_Embryophyta_Tracheophyta_Euphyllophyta_Spermatophyta_Magnoliopsida_Mesangiospermae_eudicotyledons_Gunneridae_Pentapetalae_rosids_malvids_Brassicales_Cleomaceae_New_World_clade_Tarenaya_Tarenaya_hassleriana_28532 MESAICGRLALSPSTVFNSKPGEKHSLYKGPCGNHGFVMSLCASAVGKGGGLLDKPVIEKTTPGRESEFDLRKSRKMAPPYRVILHNDNFNRREYVVQVLMKVIPGMTLDNAVNIMQEAHHNGLAVVIICAQADAEEHCMQLRGNGLLSSIEPASGGGC
So two python scripts(renaming_MMETSP.py and renaming_NCBI.py) are needed to proceed MMETSP and NCBI-nr,respectively, because they have very different naming format.
2. Method Two: blasting after renaming the header
Still remember we mentioned earlier due to the different naming strategy, we have to decide to blast before renaming the header or after? Now I will introduce the usage of two Python scripts working on the naming issues.
draw_color_tree.py
3. Step-by-step Protocol
4. Limitation
<Last updated by Xi Zhang on Oct 6th,2021> upcoming
