User Tools

Site Tools


phylogeny_protocol3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
phylogeny_protocol3 [2021/10/09 17:00] – created 134.190.232.9phylogeny_protocol3 [2021/10/10 23:01] (current) 134.190.232.9
Line 1: Line 1:
 +This part, basically, we are trying to figure out how to acquire the taxa for the genes in MMETSP and NCBI database. First, we display the acc2tax software usage here:
  
-<Last updated by Xi Zhang on Oct 6th,2021>+**1. Usage of Acc2tax** 
 + 
 +<code> 
 +#acc2tax https://github.com/richardmleggett/acc2tax  
 +# Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. 
 +# http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery  
 + 
 +#Input file 
 +CDH53707 
 +XP_011133305 
 +KMU79707 
 +XP_002963095 
 + 
 +#Outputfile 
 +CDH53707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682 
 +XP_011133305 cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes 
 +KMU79707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703 
 +</code> 
 + 
 +<code> 
 +   acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out 
 +</code> 
 + 
 +Note: don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above.  
 + 
 +E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur: 
 + 
 +<code> 
 +Couldn't find: [MBI4782295.1] 
 + 
 +</code> 
 + 
 +Trim the version ".1" behind the accession MBI4782295.1 
 + 
 +<code> 
 +> sed 's/\(.*\)\..*/\1/g' file >out_file 
 + 
 +</code> 
 + 
 +Note: It can still acquire a list of unknown like below even the NCBI taxonomy database is updated to the latest. 
 + 
 +<code> 
 +Couldn't find: [MBR3349819] 
 +Couldn't find: [HBS54143] 
 +Couldn't find: [MYJ28876] 
 +</code> 
 + 
 +This might due to these protein IDs(MBR3349819,HBS54143) from the species cannot put into the taxonomy like NP_051083. i.e., Lineage is not in (full) status. 
 + 
 + 
 +<code> 
 +NP_051083 cellular organisms,Eukaryota,Viridiplantae,Streptophyta,Streptophytina,Embryophyta,Tracheophyta,Euphyllophyta,Spermatophyta,Magnoliopsida,Mesangiospermae,eudicotyledons,Gunneridae,Pentapetalae,rosids,malvids,Brassicales,Brassicaceae,Camelineae,Arabidopsis,Arabidopsis thaliana 
 +</code> 
 + 
 +<code> 
 +# acc2tax database directory: 
 +/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021  
 +</code> 
 + 
 + 
 +**2. MMETSP hierarchical taxonomic info** 
 + 
 +<code> 
 +#New MMETSP database was used containing the taxonomy information. 
 +dir: >/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa 
 +fasta: >MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO  
 +HHYGDSHFBSJBSCJSJKCHSFBSMCNSBCMBSM 
 + 
 +# VS 
 +dir: >/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas 
 +fasta: >Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421 
 +SDFHSJFBSNVMSNVMSBHVDBCDMSNCSKFNB 
 +</code> 
 + 
 +  * Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT 
 + 
 +<code> 
 +> cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10 
 + 
 +#CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. 
 + 
 +#sequence identity threshold, default 0.9 
 + this is the default cd-hit's "global sequence identity" calculated as: 
 + number of identical amino acids in alignment 
 + divided by the full length of the shorter sequence 
 +#nword_length, default 10, see user's guide for choosing it 
 +</code> 
 +  
 +**3. Trivia ** 
 + 
 +<code> 
 +# acquire the taxon info.  
 +>grep 'AT5G53350.1' /Users/####/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt 
 + 
 +#acquire the length of fasta file. 
 +awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/####/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt 
 +</code> 
 + 
 +<code> 
 +# MMETSP DATASET 
 +CP_0198131824_1486917_Craspedostauros_australis_CCMP3328 
 + 
 +CP_0198725784__Unidentified_sp_CCMP1999 
 + 
 +CP_0202964930__Pseudokeronopsis_sp_Brazil 
 + 
 +CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0) 
 + 
 +# If for some reason the source organism cannot be mapped to the taxonomy  
 +  database, the column will contain 0. 
 +</code> 
 + 
 +  * Links  
 + 
 +<code> 
 +https://github.com/richardmleggett/acc2tax  
 +https://ftp.ncbi.nih.gov/pub/taxonomy/ 
 + 
 + https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ 
 + names.dmp, node.dmp 
 +  
 +# JGI Taxonomy Guide 
 + 
 +https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/  
 + wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip  
 + gi_taxid_nucl.dmp   gi_taxid_prot.dmp 
 +</code> 
 + 
 +  * Directories 
 + 
 +<code>  
 +/misc/scratch2/xizhang/arabidopsis/Taxonomy 
 +/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021 
 +/scratch3/rogerlab_databases/other_dbs/Acc2Tax_Feb122021 
 + 
 +nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt 
 +nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt 
 + 
 +</code> 
 + 
 +  * What is the difference of the GenPept format and the GenPept (full)? 
 + 
 +<code> 
 +Full 
 +Accession.version taxid 
 +0308206A 8058 
 +0308221A 9606 
 +0308230A 1049 
 + 
 +accession accession.version taxid gi 
 +A0A009IHW8 A0A009IHW8.1 1310613 1835922267 
 +A0A023FBW4 A0A023FBW4.1 34607 1939884164 
 +A0A023FBW7 A0A023FBW7.1 34607 1939884197 
 +</code> 
 + 
 +「 :shift + [ at PinYin keyboard 
 + 
 +」: shift + ] at PinYin keyboard 
 + 
 +# : command +/ 
 + 
 +<Last updated by Xi Zhang on Oct 9th,2021>
phylogeny_protocol3.1633809612.txt.gz · Last modified: by 134.190.232.9