Quick taxonomy recovery using the Accession numbers from either a Blast or Plast output:
use acc2tax program available in the environmental path.
here is an example of how to run it for protein IDs (-p):
acc2tax -i /db1/extra-data-sets/Acc2tax/acc2taxIN_example -p -d /db1/extra-data-sets/Acc2tax/Acc2Tax_071119 -o taxonomy.out
don't forget to make sure that your input file contains only the accession numbers without their version, see the example file given above.
E.g., MBI4782295.1 shall be MBI4782295, otherwise the bugs will occur:
Couldn't find: [MBI4782295.1]
Trim the version “.1” behind the accession MBI4782295.1
> cat file | cut -d '.' -f1 > out_file
Note: 1. You can get the accession list from Blast/Plast result (output.txt) directly using the command below:
> cat output.txt | cut -f2 | cut -d '.' -f1 > out_file
2. If there are “|” in the accession numbers (i.e., gb|KAA8922376.1|)
> cat output.txt | cut -d "|" -f2 | cut -d '.' -f1 > out_file
3. It can still acquire a list of unknown like below even the NCBI taxonomy database is updated to the latest.
Couldn't find: [MBR3349819] Couldn't find: [HBS54143] Couldn't find: [MYJ28876]
This might due to these protein IDs(MBR3349819,HBS54143) from the species cannot put into the taxonomy like NP_051083. i.e., Lineage is not in (full) status.
NP_051083 cellular organisms,Eukaryota,Viridiplantae,Streptophyta,Streptophytina,Embryophyta,Tracheophyta,Euphyllophyta,Spermatophyta,Magnoliopsida,Mesangiospermae,eudicotyledons,Gunneridae,Pentapetalae,rosids,malvids,Brassicales,Brassicaceae,Camelineae,Arabidopsis,Arabidopsis thaliana
acc2tax database Location:
/db1/extra-data-sets/Acc2tax/
/db1/extra-data-sets/Acc2tax/Acc2Tax_04_01_2024 (Up to date Jan 04, 2024)
<Last updated by Dandan Zhao on Jun 11, 2024>
