This part, basically, we are trying to figure out how to acquire the taxa for the genes in MMETSP and NCBI database. First, we display the acc2tax software usage here:
1. Usage of Acc2tax
#acc2tax https://github.com/richardmleggett/acc2tax # Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. # http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery #Input file CDH53707 XP_011133305 KMU79707 XP_002963095 #Outputfile CDH53707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682 XP_011133305 cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes KMU79707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703
acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out
Note: don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above.
E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:
Couldn't find: [MBI4782295.1]
Trim the version “.1” behind the accession MBI4782295.1
> sed 's/\(.*\)\..*/\1/g' file >out_file
Note: It can still acquire a list of unknown like below even the NCBI taxonomy database is updated to the latest.
Couldn't find: [MBR3349819] Couldn't find: [HBS54143] Couldn't find: [MYJ28876]
This might due to these protein IDs(MBR3349819,HBS54143) from the species cannot put into the taxonomy like NP_051083. i.e., Lineage is not in (full) status.
NP_051083 cellular organisms,Eukaryota,Viridiplantae,Streptophyta,Streptophytina,Embryophyta,Tracheophyta,Euphyllophyta,Spermatophyta,Magnoliopsida,Mesangiospermae,eudicotyledons,Gunneridae,Pentapetalae,rosids,malvids,Brassicales,Brassicaceae,Camelineae,Arabidopsis,Arabidopsis thaliana
# acc2tax database directory: /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021
2. MMETSP hierarchical taxonomic info
#New MMETSP database was used containing the taxonomy information. dir: >/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa fasta: >MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO HHYGDSHFBSJBSCJSJKCHSFBSMCNSBCMBSM # VS dir: >/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas fasta: >Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421 SDFHSJFBSNVMSNVMSBHVDBCDMSNCSKFNB
- Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT
> cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10 #CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. #sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as: number of identical amino acids in alignment divided by the full length of the shorter sequence #nword_length, default 10, see user's guide for choosing it
3. Trivia
# acquire the taxon info.
>grep 'AT5G53350.1' /Users/####/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt
#acquire the length of fasta file.
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/####/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt
# MMETSP DATASET CP_0198131824_1486917_Craspedostauros_australis_CCMP3328 CP_0198725784__Unidentified_sp_CCMP1999 CP_0202964930__Pseudokeronopsis_sp_Brazil CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0) # If for some reason the source organism cannot be mapped to the taxonomy database, the column will contain 0.
- Links
https://github.com/richardmleggett/acc2tax https://ftp.ncbi.nih.gov/pub/taxonomy/ https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ names.dmp, node.dmp # JGI Taxonomy Guide https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/ wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip gi_taxid_nucl.dmp gi_taxid_prot.dmp
- Directories
/misc/scratch2/xizhang/arabidopsis/Taxonomy /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021 /scratch3/rogerlab_databases/other_dbs/Acc2Tax_Feb122021 nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt
- What is the difference of the GenPept format and the GenPept (full)?
Full Accession.version taxid 0308206A 8058 0308221A 9606 0308230A 1049 accession accession.version taxid gi A0A009IHW8 A0A009IHW8.1 1310613 1835922267 A0A023FBW4 A0A023FBW4.1 34607 1939884164 A0A023FBW7 A0A023FBW7.1 34607 1939884197
「 :shift + [ at PinYin keyboard
」: shift + ] at PinYin keyboard
# : command +/
<Last updated by Xi Zhang on Oct 9th,2021>
