Differences

This shows you the differences between two versions of the page.

--- phylogeny_protocol3 [2021/10/10 15:12] – 134.190.232.9
+++ phylogeny_protocol3 [2021/10/10 23:01] (current) – 134.190.232.9
@@ Line 1: / Line 1: @@
-「Ongoing」 Here is an example of how to run it for protein IDs (-p):\\
+This part, basically, we are trying to figure out how to acquire the taxa for the genes in MMETSP and NCBI database. First, we display the acc2tax software usage here:
+**1. Usage of Acc2tax**
+<code>
+#acc2tax https://github.com/richardmleggett/acc2tax
+# Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each.
+# http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery
+#Input file
+CDH53707
+XP_011133305
+KMU79707
+XP_002963095
+#Outputfile
+CDH53707	cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682
+XP_011133305	cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes
+KMU79707	cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703
+</code>
+<code>
    acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out
+</code>
-don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above.
+Note: don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above.
-:-( E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:
+E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:
 <code>
@@ Line 13: / Line 33: @@
 </code>
-:-D Trim the version ".1" behind the accession MBI4782295.1
+Trim the version ".1" behind the accession MBI4782295.1
 <code>
@@ Line 35: / Line 55: @@
 </code>
-==========================
+<code>
-acc2tax database:\\
+# acc2tax database directory:
 /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021
-\\
+</code>
-MMETSP
+**2. MMETSP hierarchical taxonomic info**
-.	New MMETSP database was used containing the taxonomy information.
+<code>
+#New MMETSP database was used containing the taxonomy information.
+dir: >/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa
+fasta: >MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO
+HHYGDSHFBSJBSCJSJKCHSFBSMCNSBCMBSM
->/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa
->MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO
 # VS
->/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas
+dir: >/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas
+fasta: >Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421
+SDFHSJFBSNVMSNVMSBHVDBCDMSNCSKFNB
+</code>
->Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421
+  * Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT
-#The directory of shared folder in Perun:  /scratch4/shared/
-# The BLAST -max_target_seqs option does work on yielding all the hits for each Clp in	 MMETSP and NCBI-Nr.
-.	Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT
-You are right, I am working on reducing the redundancy of MMETSP. Bruce suggested a tool (CD-HIT) which works well for reducing the redundancy of the cases you metioned in MMETSP. Taking AT5G15450.1 as example, the MMETSP BLAST search hits was reduced from 502 to ~150. I then gave a first look at what species matching with AT5G15450.1 (E value <=10-5)(see below). Tree was created from the NCBI common taxonomy tree (https://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi)
+<code>
 > cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10
-•	CD-HIT
+#CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.
-CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.
+#sequence identity threshold, default 0.9
--csequence identity threshold, default 0.9
  this is the default cd-hit's "global sequence identity" calculated as:
  number of identical amino acids in alignment
  divided by the full length of the shorter sequence
+#nword_length, default 10, see user's guide for choosing it
-  -nword_length, default 10, see user's guide for choosing it
+</code>
+**3. Trivia **
-.	acc2tax https://github.com/richardmleggett/acc2tax
+<code>
-# Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each.
-# http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery
-#Input file
-CDH53707
-XP_011133305
-KMU79707
-XP_002963095
-#Outputfile
-CDH53707	cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682
-XP_011133305	cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes
-KMU79707	cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703
-.	Multi-gene phylogeny tree
-http://129.173.88.134:81/dokuwiki/doku.php?id=multi-gene_phylogeny_pipeline
-Manually: (i.e. taxon has an unexpected placement on the tree) or that generate extra-long branches.
-Operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals.
 # acquire the taxon info.
->grep 'AT5G53350.1' /Users/zxwinner/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt
+>grep 'AT5G53350.1' /Users/####/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt
 #acquire the length of fasta file.
-awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/zxwinner/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt
+awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/####/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt
+</code>
-.	Refining the trees
+<code>
+# MMETSP DATASET
 CP_0198131824_1486917_Craspedostauros_australis_CCMP3328
@@ Line 116: / Line 109: @@
 CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0)
-If for some reason the source organism cannot be mapped to the taxonomy
+# If for some reason the source organism cannot be mapped to the taxonomy
  	 database, the column will contain 0.
+</code>
+  * Links
-NCBI-NR
+<code>
-.	 Taxon in NCBI-nr (As of Sep 1st.)
-•
-grep 'ATCG00670.1' /Users/zxwinner/Desktop/NCBI-NR/BLASTP_nr_tair10_1.tsv |awk '{print $2}'|sed 's/.*\|\(.*\)\|/\1/g'|
-grep "AT1G49970.1" /Users/zxwinner/Desktop/NCBI-NR/BLASTP_nr_tair10_1.tsv |awk '{print $2}'|sed 's/.*\|\(.*\)\..*\|/\1/g'|
-•	Links
 https://github.com/richardmleggett/acc2tax
 https://ftp.ncbi.nih.gov/pub/taxonomy/
@@ Line 135: / Line 122: @@
 	names.dmp, node.dmp
-•	JGI Taxonomy Guide
+# JGI Taxonomy Guide
 https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/
 	wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
 	gi_taxid_nucl.dmp   gi_taxid_prot.dmp
+</code>
+  * Directories
-•	Directories
+<code>
 /misc/scratch2/xizhang/arabidopsis/Taxonomy
 /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021
@@ Line 149: / Line 137: @@
 nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt
 nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt
+</code>
-•	acc2tax
+  * What is the difference of the GenPept format and the GenPept (full)?
- Richard.Leggett@tgac.ac.uk
-Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each.
+<code>
-Lookup for Genbank IDs is quicker than for accessions, as the lookup table is stored in RAM (though this does mean it takes a couple of minutes to load). For accessions, the lookup is from disc.
-Provide batch taxonomy information for Genbank IDs or Accessions.
-Options:
-    [-h | --help]       This help screen.
-    [-a | --accession]  Query is accession IDs [default].
-    [-c | --column]     1-based column number of ID in input file (default 1).
-    [-d | --database]   Directory containing NCBI taxonomy files.
-    [-e | --entries]    Max GI entries (default 1050000000).
-    [-g | --gi]         Query is Genbank IDs.
-    [-i | --input]      File of IDs (GI or Accession), one per line.
-    [-k | --keep]       Copy columns from input to output file, then append taxonomy as new column.
-    [-n | --nucleotide] Query IDs are nucleotide [default].
-    [-o | --output]     Filename of output file.
-    [-p | --protein]    Query IDs are protein.
-    [-s | --strip]      Strip version from input acession IDs (ie. everything after .)
-•	What is the difference of the GenPept format and the GenPept (full)?
 Full
 Accession.version	taxid
@@ Line 187: / Line 154: @@
 A0A023FBW4	A0A023FBW4.1	34607	1939884164
 A0A023FBW7	A0A023FBW7.1	34607	1939884197
+</code>
 「 ：shift + [ at PinYin keyboard
 」: shift + ] at PinYin keyboard