This is an old revision of the document!
Here is an example of how to run it for protein IDs (-p):
acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out
don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above.
E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:
Couldn't find: [MBI4782295.1]
Trim the version “.1” behind the accession MBI4782295.1
> sed 's/\(.*\)\..*/\1/g' file >out_file
Note: It can still acquire a list of unknown like below even the NCBI taxonomy database is updated to the latest.
Couldn't find: [MBR3349819] Couldn't find: [HBS54143] Couldn't find: [MYJ28876]
This might due to these protein IDs(MBR3349819,HBS54143) from the species cannot put into the taxonomy like NP_051083. i.e., Lineage is not in (full) status.
NP_051083 cellular organisms,Eukaryota,Viridiplantae,Streptophyta,Streptophytina,Embryophyta,Tracheophyta,Euphyllophyta,Spermatophyta,Magnoliopsida,Mesangiospermae,eudicotyledons,Gunneridae,Pentapetalae,rosids,malvids,Brassicales,Brassicaceae,Camelineae,Arabidopsis,Arabidopsis thaliana
acc2tax database:
/misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021
MMETSP
7. New MMETSP database was used containing the taxonomy information.
/db1/extra-data-sets/MMETSP/MMETSP_db/MMETSP_DB_clean.v2018.fa
MMETSP0484-20121128|722 Rhodomonas_lens_Strain_RHODO
# VS
/scratch3/sibbald/DATABASES/CAM_P_0001000.pep.renamed_nr_db_temp.fas
Symbiodinium_sp@CP_0181467638_Eukaryota_Alveolata_Dinophyceae_Suessiales_Symbiodiniaceae_Symbiodinium_zzz_CP_0181467638_174948_Symbiodinium_sp_CCMP421
#The directory of shared folder in Perun: /scratch4/shared/ # The BLAST -max_target_seqs option does work on yielding all the hits for each Clp in MMETSP and NCBI-Nr.
8. Reducing the redundancy of MMETSP and NCBI-nr. CD-HIT
You are right, I am working on reducing the redundancy of MMETSP. Bruce suggested a tool (CD-HIT) which works well for reducing the redundancy of the cases you metioned in MMETSP. Taking AT5G15450.1 as example, the MMETSP BLAST search hits was reduced from 502 to ~150. I then gave a first look at what species matching with AT5G15450.1 (E value ⇐10-5)(see below). Tree was created from the NCBI common taxonomy tree (https://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi)
cd-hit-est -i out_AT5G15450.1_hits.fa -o AT5G15450.1_clp -c 0.8 -n 10
• CD-HIT
CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.
-csequence identity threshold, default 0.9 this is the default cd-hit's “global sequence identity” calculated as: number of identical amino acids in alignment divided by the full length of the shorter sequence
- nword_length, default 10, see user's guide for choosing it
9. acc2tax https://github.com/richardmleggett/acc2tax # Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. # http://129.173.88.134:81/dokuwiki/doku.php?id=taxonomy_recovery
#Input file CDH53707 XP_011133305 KMU79707 XP_002963095
#Outputfile CDH53707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Fungi incertae sedis,Mucoromycota,Mucoromycotina,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia,Lichtheimia corymbifera,Lichtheimia corymbifera JMRC:FSU:9682 XP_011133305 cellular organisms,Eukaryota,Alveolata,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,Gregarinidae,Gregarina,Gregarina niphandrodes KMU79707 cellular organisms,Eukaryota,Opisthokonta,Fungi,Dikarya,Ascomycota,saccharomyceta,Pezizomycotina,leotiomyceta,Eurotiomycetes,Eurotiomycetidae,Onygenales,Onygenales incertae sedis,Coccidioides,Coccidioides immitis,Coccidioides immitis RMSCC 3703
10. Multi-gene phylogeny tree http://129.173.88.134:81/dokuwiki/doku.php?id=multi-gene_phylogeny_pipeline
Manually: (i.e. taxon has an unexpected placement on the tree) or that generate extra-long branches. Operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals.
# acquire the taxon info.
grep 'AT5G53350.1' /Users/zxwinner/Desktop/MMETSP/CAM_MMETSP/BLASTP_CAM_MMETSP_tair10_5.tsv |awk '{print $2}'|sed 's/_/\t/g'|awk '{print $3}'|sort -V|uniq >1.txt
#acquire the length of fasta file. awk '/^>/{if (l!=“”) print l; print; l=0; next}{l+=length($0)}END{print l}' /Users/zxwinner/Desktop/Athaliana_Project_2021/Athaliana_24_aa.fasta |paste - - |cut -f 1 > col3.txt
11. Refining the trees
CP_0198131824_1486917_Craspedostauros_australis_CCMP3328
CP_0198725784Unidentified_sp_CCMP1999 CP_0202964930Pseudokeronopsis_sp_Brazil
CP_0174260800_1078864_Stereomyxa_ramosa_Chinc5 (0)
If for some reason the source organism cannot be mapped to the taxonomy database, the column will contain 0.
NCBI-NR
12. Taxon in NCBI-nr (As of Sep 1st.)
• grep 'ATCG00670.1' /Users/zxwinner/Desktop/NCBI-NR/BLASTP_nr_tair10_1.tsv |awk '{print $2}'|sed 's/.*\|\(.*\)\|/\1/g'| grep “AT1G49970.1” /Users/zxwinner/Desktop/NCBI-NR/BLASTP_nr_tair10_1.tsv |awk '{print $2}'|sed 's/.*\|\(.*\)\..*\|/\1/g'|
• Links https://github.com/richardmleggett/acc2tax https://ftp.ncbi.nih.gov/pub/taxonomy/
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ names.dmp, node.dmp
• JGI Taxonomy Guide
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/ wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip gi_taxid_nucl.dmp gi_taxid_prot.dmp
• Directories
/misc/scratch2/xizhang/arabidopsis/Taxonomy /misc/db1/extra-data-sets/Acc2tax/Acc2tax_092021 /scratch3/rogerlab_databases/other_dbs/Acc2Tax_Feb122021
nohup zcat dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz |sort > nucl_all.txt
nohup zcat dead_prot.accession2taxid.gz prot.accession2taxid.gz | sort > prot_all.txt
• acc2tax Richard.Leggett@tgac.ac.uk
Given a file of accessions or Genbank IDs (one per line), this program will return a taxonomy string for each. Lookup for Genbank IDs is quicker than for accessions, as the lookup table is stored in RAM (though this does mean it takes a couple of minutes to load). For accessions, the lookup is from disc.
Provide batch taxonomy information for Genbank IDs or Accessions. Options:
[-h | --help] This help screen. [-a | --accession] Query is accession IDs [default]. [-c | --column] 1-based column number of ID in input file (default 1). [-d | --database] Directory containing NCBI taxonomy files. [-e | --entries] Max GI entries (default 1050000000). [-g | --gi] Query is Genbank IDs. [-i | --input] File of IDs (GI or Accession), one per line. [-k | --keep] Copy columns from input to output file, then append taxonomy as new column. [-n | --nucleotide] Query IDs are nucleotide [default]. [-o | --output] Filename of output file. [-p | --protein] Query IDs are protein. [-s | --strip] Strip version from input acession IDs (ie. everything after .)
• What is the difference of the GenPept format and the GenPept (full)? Full Accession.version taxid 0308206A 8058 0308221A 9606 0308230A 1049
accession accession.version taxid gi A0A009IHW8 A0A009IHW8.1 1310613 1835922267 A0A023FBW4 A0A023FBW4.1 34607 1939884164 A0A023FBW7 A0A023FBW7.1 34607 1939884197
<Last updated by Xi Zhang on Oct 9th,2021>
