This part, basically, we are trying to figure out how to acquire the taxa for the genes in MMETSP and NCBI database. First, we display the acc2tax software usage here:
**1. Usage of Acc2tax**
#acc2tax https://github.com/richardmleggett/acc2tax
# Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each.
# http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery
#Input file
CDH53707
XP_011133305
KMU79707
XP_002963095
#Outputfile
CDH53707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682
XP_011133305 cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes
KMU79707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703
acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out
Note: don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above.
E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:
Couldn't find: [MBI4782295.1]
Trim the version ".1" behind the accession MBI4782295.1
> sed 's/\(.*\)\..*/\1/g' file >out_file
Note: It can still acquire a list of unknown like below even the NCBI taxonomy database is updated to the latest.
Couldn't find: [MBR3349819]
Couldn't find: [HBS54143]
Couldn't find: [MYJ28876]
This might due to these protein IDs(MBR3349819,HBS54143) from the species cannot put into the taxonomy like NP_051083. i.e., Lineage is not in (full) status.
NP_051083 cellular organisms,Eukaryota,Viridiplantae,Streptophyta,Streptophytina,Embryophyta,Tracheophyta,Euphyllophyta,Spermatophyta,Magnoliopsida,Mesangiospermae,eudicotyledons,Gunneridae,Pentapetalae,rosids,malvids,Brassicales,Brassicaceae,Camelineae,Arabidopsis,Arabidopsis thaliana
# acc2tax database directory:
/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021
**2. MMETSP hierarchical taxonomic info**
#New MMETSP database was used containing the taxonomy information.
dir: >/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa
fasta: >MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO
HHYGDSHFBSJBSCJSJKCHSFBSMCNSBCMBSM
# VS
dir: >/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas
fasta: >Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421
SDFHSJFBSNVMSNVMSBHVDBCDMSNCSKFNB
* Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT
> cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10
#CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.
#sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids in alignment
divided by the full length of the shorter sequence
#nword_length, default 10, see user's guide for choosing it
**3. Trivia **
# acquire the taxon info.
>grep 'AT5G53350.1' /Users/####/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt
#acquire the length of fasta file.
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/####/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt
# MMETSP DATASET
CP_0198131824_1486917_Craspedostauros_australis_CCMP3328
CP_0198725784__Unidentified_sp_CCMP1999
CP_0202964930__Pseudokeronopsis_sp_Brazil
CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0)
# If for some reason the source organism cannot be mapped to the taxonomy
database, the column will contain 0.
* Links
https://github.com/richardmleggett/acc2tax
https://ftp.ncbi.nih.gov/pub/taxonomy/
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
names.dmp, node.dmp
# JGI Taxonomy Guide
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
gi_taxid_nucl.dmp gi_taxid_prot.dmp
* Directories
/misc/scratch2/xizhang/arabidopsis/Taxonomy
/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021
/scratch3/rogerlab_databases/other_dbs/Acc2Tax_Feb122021
nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt
nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt
* What is the difference of the GenPept format and the GenPept (full)?
Full
Accession.version taxid
0308206A 8058
0308221A 9606
0308230A 1049
accession accession.version taxid gi
A0A009IHW8 A0A009IHW8.1 1310613 1835922267
A0A023FBW4 A0A023FBW4.1 34607 1939884164
A0A023FBW7 A0A023FBW7.1 34607 1939884197
「 :shift + [ at PinYin keyboard
」: shift + ] at PinYin keyboard
# : command +/