User Tools

Site Tools


phylogeny_protocol3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
phylogeny_protocol3 [2021/10/09 17:01] 134.190.232.9phylogeny_protocol3 [2021/10/10 23:01] (current) 134.190.232.9
Line 1: Line 1:
-Here is an example of how to run it for protein IDs (-p):\\+This part, basically, we are trying to figure out how to acquire the taxa for the genes in MMETSP and NCBI database. First, we display the acc2tax software usage here:
  
 +**1. Usage of Acc2tax**
  
 +<code>
 +#acc2tax https://github.com/richardmleggett/acc2tax 
 +# Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each.
 +# http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery 
 +
 +#Input file
 +CDH53707
 +XP_011133305
 +KMU79707
 +XP_002963095
 +
 +#Outputfile
 +CDH53707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682
 +XP_011133305 cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes
 +KMU79707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703
 +</code>
 +
 +<code>
    acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out    acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out
 +</code>
  
-don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above. +Note: don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above. 
  
-:-( E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:+E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:
  
 <code> <code>
Line 13: Line 33:
 </code> </code>
  
-:-D Trim the version ".1" behind the accession MBI4782295.1+Trim the version ".1" behind the accession MBI4782295.1
  
 <code> <code>
Line 35: Line 55:
 </code> </code>
  
-========================== +<code> 
-acc2tax database:\\+# acc2tax database directory: 
 +/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021  
 +</code> 
 + 
 + 
 +**2. MMETSP hierarchical taxonomic info** 
 + 
 +<code> 
 +#New MMETSP database was used containing the taxonomy information. 
 +dir: >/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa 
 +fasta: >MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO  
 +HHYGDSHFBSJBSCJSJKCHSFBSMCNSBCMBSM 
 + 
 +# VS 
 +dir: >/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas 
 +fasta: >Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421 
 +SDFHSJFBSNVMSNVMSBHVDBCDMSNCSKFNB 
 +</code> 
 + 
 +  * Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT 
 + 
 +<code> 
 +> cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10 
 + 
 +#CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. 
 + 
 +#sequence identity threshold, default 0.9 
 + this is the default cd-hit's "global sequence identity" calculated as: 
 + number of identical amino acids in alignment 
 + divided by the full length of the shorter sequence 
 +#nword_length, default 10, see user's guide for choosing it 
 +</code> 
 +  
 +**3. Trivia ** 
 + 
 +<code> 
 +# acquire the taxon info.  
 +>grep 'AT5G53350.1' /Users/####/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt 
 + 
 +#acquire the length of fasta file. 
 +awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/####/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt 
 +</code> 
 + 
 +<code> 
 +# MMETSP DATASET 
 +CP_0198131824_1486917_Craspedostauros_australis_CCMP3328 
 + 
 +CP_0198725784__Unidentified_sp_CCMP1999 
 + 
 +CP_0202964930__Pseudokeronopsis_sp_Brazil 
 + 
 +CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0) 
 + 
 +# If for some reason the source organism cannot be mapped to the taxonomy  
 +   database, the column will contain 0. 
 +</code> 
 + 
 +  * Links  
 + 
 +<code> 
 +https://github.com/richardmleggett/acc2tax  
 +https://ftp.ncbi.nih.gov/pub/taxonomy/ 
 + 
 + https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ 
 + names.dmp, node.dmp 
 +  
 +# JGI Taxonomy Guide 
 + 
 +https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/  
 + wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip  
 + gi_taxid_nucl.dmp   gi_taxid_prot.dmp 
 +</code> 
 + 
 +  * Directories 
 + 
 +<code>  
 +/misc/scratch2/xizhang/arabidopsis/Taxonomy 
 +/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021 
 +/scratch3/rogerlab_databases/other_dbs/Acc2Tax_Feb122021 
 + 
 +nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt 
 +nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt 
 + 
 +</code> 
 + 
 +  * What is the difference of the GenPept format and the GenPept (full)? 
 + 
 +<code> 
 +Full 
 +Accession.version taxid 
 +0308206A 8058 
 +0308221A 9606 
 +0308230A 1049 
 + 
 +accession accession.version taxid gi 
 +A0A009IHW8 A0A009IHW8.1 1310613 1835922267 
 +A0A023FBW4 A0A023FBW4.1 34607 1939884164 
 +A0A023FBW7 A0A023FBW7.1 34607 1939884197 
 +</code> 
 + 
 +「 :shift + [ at PinYin keyboard
  
-/scratch3/rogerlab_databases/other_dbs/Acc2Tax_Feb122021 (Up to date Feb 23, 2021)+」: shift + ] at PinYin keyboard
  
-/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021 (Up to date Sep 20, 2021) +# : command +/
-\\+
  
 <Last updated by Xi Zhang on Oct 9th,2021> <Last updated by Xi Zhang on Oct 9th,2021>
phylogeny_protocol3.1633809690.txt.gz · Last modified: by 134.190.232.9