User Tools

Site Tools


phylogeny_protocol3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
phylogeny_protocol3 [2021/10/09 17:12] 134.190.232.9phylogeny_protocol3 [2021/10/10 23:01] (current) 134.190.232.9
Line 1: Line 1:
-「Ongoing」 Here is an example of how to run it for protein IDs (-p):\\+This part, basically, we are trying to figure out how to acquire the taxa for the genes in MMETSP and NCBI database. First, we display the acc2tax software usage here:
  
 +**1. Usage of Acc2tax**
  
 +<code>
 +#acc2tax https://github.com/richardmleggett/acc2tax 
 +# Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each.
 +# http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery 
 +
 +#Input file
 +CDH53707
 +XP_011133305
 +KMU79707
 +XP_002963095
 +
 +#Outputfile
 +CDH53707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682
 +XP_011133305 cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes
 +KMU79707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703
 +</code>
 +
 +<code>
    acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out    acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out
 +</code>
  
-don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above. +Note: don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above. 
  
-:-( E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:+E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:
  
 <code> <code>
Line 13: Line 33:
 </code> </code>
  
-:-D Trim the version ".1" behind the accession MBI4782295.1+Trim the version ".1" behind the accession MBI4782295.1
  
 <code> <code>
Line 35: Line 55:
 </code> </code>
  
-========================== +<code> 
-acc2tax database:\\ +acc2tax database directory:
- +
 /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021  /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021
-\\+</code>
  
  
-MMETSP+**2. MMETSP hierarchical taxonomic info**
  
-7. New MMETSP database was used containing the taxonomy information.+<code> 
 +#New MMETSP database was used containing the taxonomy information. 
 +dir: >/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa 
 +fasta: >MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO  
 +HHYGDSHFBSJBSCJSJKCHSFBSMCNSBCMBSM
  
->/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa 
->MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO  
 # VS # VS
->/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas+dir: >/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas 
 +fasta: >Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421 
 +SDFHSJFBSNVMSNVMSBHVDBCDMSNCSKFNB 
 +</code>
  
->Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421 +  * Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT
- +
-#The directory of shared folder in Perun:  /scratch4/shared/ +
-# The BLAST -max_target_seqs option does work on yielding all the hits for each Clp in MMETSP and NCBI-Nr.  +
- +
- +
-8. Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT +
- +
-You are right, I am working on reducing the redundancy of MMETSP. Bruce suggested a tool (CD-HIT) which works well for reducing the redundancy of the cases you metioned in MMETSP. Taking AT5G15450.1 as example, the MMETSP BLAST search hits was reduced from 502 to ~150. I then gave a first look at what species matching with AT5G15450.1 (E value <=10-5)(see below). Tree was created from the NCBI common taxonomy tree (https://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi) +
  
 +<code>
 > cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10 > cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10
  
-• CD-HIT +#CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.
- +
-CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.+
  
--csequence identity threshold, default 0.9+#sequence identity threshold, default 0.9
  this is the default cd-hit's "global sequence identity" calculated as:  this is the default cd-hit's "global sequence identity" calculated as:
  number of identical amino acids in alignment  number of identical amino acids in alignment
  divided by the full length of the shorter sequence  divided by the full length of the shorter sequence
- +#nword_length, default 10, see user's guide for choosing it 
-  -nword_length, default 10, see user's guide for choosing it +</code>
    
 +**3. Trivia **
  
-9. acc2tax https://github.com/richardmleggett/acc2tax  +<code>
-# Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. +
-# http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery  +
- +
-#Input file +
-CDH53707 +
-XP_011133305 +
-KMU79707 +
-XP_002963095 +
- +
-#Outputfile +
-CDH53707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682 +
-XP_011133305 cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes +
-KMU79707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703 +
- +
- +
-10. Multi-gene phylogeny tree +
-http://129.173.88.134:81/dokuwiki/doku.php?id=multi-gene_phylogeny_pipeline  +
- +
-Manually: (i.e. taxon has an unexpected placement on the tree) or that generate extra-long branches. +
-Operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. +
 # acquire the taxon info.  # acquire the taxon info. 
->grep 'AT5G53350.1' /Users/zxwinner/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt+>grep 'AT5G53350.1' /Users/####/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt
  
 #acquire the length of fasta file. #acquire the length of fasta file.
-awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/zxwinner/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt +awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/####/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt 
- +</code>
-11. Refining the trees +
  
 +<code>
 +# MMETSP DATASET
 CP_0198131824_1486917_Craspedostauros_australis_CCMP3328 CP_0198131824_1486917_Craspedostauros_australis_CCMP3328
  
Line 116: Line 109:
 CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0) CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0)
  
-If for some reason the source organism cannot be mapped to the taxonomy +If for some reason the source organism cannot be mapped to the taxonomy 
   database, the column will contain 0.   database, the column will contain 0.
 +</code>
  
 +  * Links 
  
-NCBI-NR +<code>
- +
-12. Taxon in NCBI-nr (As of Sep 1st.) +
- +
-•  +
-grep 'ATCG00670.1' /Users/zxwinner/Desktop/NCBI-NR/BLASTP_nr_tair10_1.tsv |awk '{print $2}'|sed 's/.*\|\(.*\)\|/\1/g'    +
-grep "AT1G49970.1" /Users/zxwinner/Desktop/NCBI-NR/BLASTP_nr_tair10_1.tsv |awk '{print $2}'|sed 's/.*\|\(.*\)\..*\|/\1/g'+
- +
-• Links +
 https://github.com/richardmleggett/acc2tax  https://github.com/richardmleggett/acc2tax 
 https://ftp.ncbi.nih.gov/pub/taxonomy/ https://ftp.ncbi.nih.gov/pub/taxonomy/
Line 135: Line 122:
  names.dmp, node.dmp  names.dmp, node.dmp
   
-JGI Taxonomy Guide+JGI Taxonomy Guide
  
 https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/  https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/ 
  wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip   wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip 
  gi_taxid_nucl.dmp   gi_taxid_prot.dmp  gi_taxid_nucl.dmp   gi_taxid_prot.dmp
 +</code>
  
-  +  * Directories
-Directories+
  
 +<code>
 /misc/scratch2/xizhang/arabidopsis/Taxonomy /misc/scratch2/xizhang/arabidopsis/Taxonomy
 /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021 /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021
Line 149: Line 137:
  
 nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt
- 
 nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt
  
 +</code>
  
-• acc2tax  +  * What is the difference of the GenPept format and the GenPept (full)?
- Richard.Leggett@tgac.ac.uk+
  
-Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. +<code>
-Lookup for Genbank IDs is quicker than for accessions, as the lookup table is stored in RAM (though this does mean it takes a couple of minutes to load). For accessions, the lookup is from disc. +
- +
-Provide batch taxonomy information for Genbank IDs or Accessions. +
-Options: +
-    [-h | --help]       This help screen. +
-    [-a | --accession]  Query is accession IDs [default]. +
-    [-c | --column]     1-based column number of ID in input file (default 1). +
-    [-d | --database]   Directory containing NCBI taxonomy files. +
-    [-e | --entries]    Max GI entries (default 1050000000). +
-    [-g | --gi]         Query is Genbank IDs. +
-    [-i | --input]      File of IDs (GI or Accession), one per line. +
-    [-k | --keep]       Copy columns from input to output file, then append taxonomy as new column. +
-    [-n | --nucleotide] Query IDs are nucleotide [default]. +
-    [-o | --output]     Filename of output file. +
-    [-p | --protein]    Query IDs are protein. +
-    [-s | --strip]      Strip version from input acession IDs (ie. everything after .) +
- +
- +
- +
-• What is the difference of the GenPept format and the GenPept (full)?+
 Full Full
 Accession.version taxid Accession.version taxid
Line 187: Line 154:
 A0A023FBW4 A0A023FBW4.1 34607 1939884164 A0A023FBW4 A0A023FBW4.1 34607 1939884164
 A0A023FBW7 A0A023FBW7.1 34607 1939884197 A0A023FBW7 A0A023FBW7.1 34607 1939884197
 +</code>
  
 「 :shift + [ at PinYin keyboard 「 :shift + [ at PinYin keyboard
 +
 」: shift + ] at PinYin keyboard 」: shift + ] at PinYin keyboard
 +
 +# : command +/
  
 <Last updated by Xi Zhang on Oct 9th,2021> <Last updated by Xi Zhang on Oct 9th,2021>
phylogeny_protocol3.1633810369.txt.gz · Last modified: by 134.190.232.9