User Tools

Site Tools


blast_protocol

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
blast_protocol [2021/09/02 16:11] – created 38.20.199.40blast_protocol [2022/09/06 14:49] (current) 134.190.232.106
Line 1: Line 1:
-**Guide for BLAST usage**+**Guide for BLAST+ (ncbi-blast-2.8.0+ or later) usage**
  
-  - blastp:search protein database(e.g., SwissProt db, NCBI-nr) using protein sequence query +  - __blastp__: search protein database(e.g., SwissProt db, NCBI-nr) using protein sequence query 
-  - blastn:search nucleotide database(e.g., NCBI-nt, MMETSP_DB_clean.v2018.fa)using nucleotide sequence query +  - __blastn__: search nucleotide database(e.g., NCBI-nt, MMETSP_DB_clean.v2018.fa)using nucleotide sequence query 
-  - blastx:search protein database with translated nucleotide sequence query +  - __blastx__: search protein database with translated nucleotide sequence query 
-  - tblastn:search translated nucleotide database with protein sequence query +  - __tblastn__: search translated nucleotide database with protein sequence query 
-  - tblastx:search translated nucleotide database with translated nucleotide sequence query+  - __tblastx__: search translated nucleotide database with translated nucleotide sequence query
  
-//Note: blastp and blastx can usually provide better hit alignments than blastn, especially for distantly related species.This is because amino acids sequences are more conserved than nucleotides (Koonin and Galperin, 2002).// +{{:blast.png?400|}} 
 + 
 +//**blastp** can usually provide better hit alignments than blastn, especially for distantly related species.This is partially because amino acids sequences are more conserved than nucleotides (Koonin and Galperin, 2002).//  
 + 
 +//**blastx** translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful __when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors__. Thus blastx is often the first analysis performed with a newly determined nucleotide sequence.// 
 + 
 +//**tblastn** is useful for __finding homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs)__ and draft genome records, ESTs are short, single-read cDNA sequences. They comprise the largest pool of sequence data for many organisms and contain portions of transcripts from many uncharacterized genes. __Since ESTs have no annotated coding sequences, there are no corresponding protein translations in the BLAST protein databases.__ Hence a tblastn search is the only way to search for these potential coding regions at the protein level.//  
 +Courtesy of the web source: https://guides.lib.berkeley.edu/ncbi/blast 
  
 **General bugs**  **General bugs** 
Line 19: Line 26:
 </code> </code>
  
-Solve :+Solve1 :
 This is due to mistakenly using the blast options.  This is due to mistakenly using the blast options. 
  
Line 34: Line 41:
 Using BLASTP search option to blast the amino acid sequences against uniport_db database. Using BLASTP search option to blast the amino acid sequences against uniport_db database.
 <code> <code>
-./blastp -query XXX.fasta -db uniprot_db -out BLASTP_XXX_uniprot.xml -evalue 1e-5 -outfmt 5+./blastp -query XXX.fasta -db uniprot_db -out BLASTP_XXX_uniprot.xml -evalue 1e-5 -outfmt 5
 </code> </code>
  
Line 77: Line 84:
     25 salltitles    All subject titles, separated by '&lt;&gt;'     25 salltitles    All subject titles, separated by '&lt;&gt;'
  
-    python blastxml_to_tabular.py -o output.tabular -c std input.xml +    python blastxml_to_tabular.py -o output.tabular -c std input.xml 
-    python blastxml_to_tabular.py -o output.tabular -c ext input.xml +    python blastxml_to_tabular.py -o output.tabular -c ext input.xml 
-    python blastxml_to_tabular.py -o output.tabular -c qseqid,qlen,salltitles,sseqid,slen,bitscore,qframe,pident,evalue,qstart,qend,sstart,send,length input.xml+    python blastxml_to_tabular.py -o output.tabular -c qseqid,qlen,salltitles,sseqid,slen,bitscore,qframe,pident,evalue,qstart,qend,sstart,send,length input.xml
 </code> </code>
 +
 +#This is another way to parse BLAST outputs via using -outfmt '6 qseqid sseqid ...'
 +
 +<code>
 +#!/bin/bash
 +#$ -S /bin/bash
 +. /etc/profile
 +#$ -pe threaded 2
 +#$ -cwd
 +source activate blast
 +export BLASTDB= /misc/scratch3/rogerlab_databases/other_dbs/nr_010621
 +DB=nr
 +query=ATCG00670.1.fasta
 +blastp -db $DB -query $query -out /scratch2/xizhang/BLASTP_nr.tsv -num_threads 2 -outfmt '6 qseqid sseqid evalue pident qcovs length slen qlen qstart qend sstart send stitle'
 +source deactivate
 +</code>
 +
 +Sep 6th,2022 Since Diamond is faster on BLASTP and BLASTx, this is another way using Diamond 
 +
 +<code>
 +#!/bin/bash
 +#$ -S /bin/bash
 +. /etc/profile
 +#$ -pe threaded 40
 +#$ -cwd
 +source activate /scratch2/software/anaconda/envs/diamond-2.0.7
 +#DB=nr
 +while read line
 +do 
 +
 +diamond blastp -p 40 -k 5 -e 1e-10 -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle salltitles --header -d /misc/scratch3/rogerlab_databases/other_dbs/nr_02032022/diamond_nr.dmnd -q $line -o BLASTP_nr.$line.tsv --sensitive
 +
 +done <$1
 +
 +conda deactivate
 +
 +</code>
 +
 +
 +**V5 NCBI database**
 +
 +The latest blast+ package can be found via https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
 +{{:blast_screenshot.png?nolink&600|}}
 +
 +The V5 NCBI database can be found via https://ftp.ncbi.nlm.nih.gov/blast/db/v5. The good thing about V5 than V4 database is not just the former is faster, but also the option to screen out your interested taxonomy.
 +
 +In order to limit your BLAST+ search by taxonomy, you’ll need to obtain the taxid(s) for your organism(s). Two options can be used here: "taxids" or "taxidlist"
 +
 +This is to acquire the taxid list for your interested organism e.g.,bacteria 
 +<code>
 +./get_species_taxids.sh -n bacteria
 +</code> 
 +get_species_taxids.sh script is from the blast+ package under the bin directory.
 +Taxid for bacteria is 2. Then acquire a list of taxonomy ids from bacteria species. 
 +
 +<code>
 +./get_species_taxids.sh -t 2 > 2.txids
 +</code>
 +
 +Using 2.txids to limit the NCBI v5 database search scope is far more efficient.
 +
 +<code>
 +./blastp –db nr –query QUERY –taxidlist 2.txids –outfmt 5 –out OUTPUT.tab
 +./blastp –db nr –query QUERY –taxids 1117,1118,1119,1121 –outfmt 5 –out OUTPUT.tab
 +</code>
 +
 +If use "taxids" option, use comma to separate different organisms.e.g., different cyanobacteria organisms: 1117,1118,1119,1121
 +
 +
 +Note: Please refer to the guide for the most updated information. https://ftp.ncbi.nlm.nih.gov/blast/db/v5/v5/blastdbv5.pdf
 +
 +{{:22-you-got-this-meme-5.jpg?nolink&200|}}
 +
 +<Last updated by Xi Zhang on Sep 3rd,2021>
blast_protocol.1630609892.txt.gz · Last modified: by 38.20.199.40 · Currently locked by: 216.73.216.59