User Tools

Site Tools


bioinformatics_tools2

This is an old revision of the document!


Usually Perun can be used to submit jobs via: qsub -q 768G-batch script.sh or qsub -q 256-batch script.sh; However, what if you have thousands of scripts waiting for running, are you going to submit thousands of shell script manually? That is definitely terrible. Here will introduce two approaches to realize submitting batch scripts/tasks in Perun. It depends on your own preference to use one of which.

Approach One: submit array jobs

Below is a real case to BLAST thousands of genes against NCBI-nr database. However, it could take weeks running if we BLAST whole gene against the nr database directly.

#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -o logfile
#$ -pe threaded 10
source activate blast
export BLASTDB=/db1/blastdb-sep1-2021/
DB=nr

blastp -db $DB -query /misc/scratch2/XXXX/XXX.fa -out /misc/scratch2/XXXX/XXX.tsv -num_threads 10 -outfmt "6 qseqid sseqid evalue pident qcovs length slen qlen qstart qend sstart send stitle salltitles" -evalue 1e-10 -max_target_seqs 1000

conda deactivate

But what if we split the query dataset into subset data and then do the blast, it would be more efficient to use the CPU of Perun and less time needed.For example, we have 2000 genes to blast against NCBI-nr. The script below will run array jobs, i.e., multiple tasks simultaneously. via using ${SGE_TASK_ID}. Here is a real case to demonstrate the usage.

#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -t 1-2000
#$ -o logfile
#$ -pe threaded 1

source activate blast
export BLASTDB=/db1/blastdb-sep1-2021/
DB=nr

blastp -db $DB -query /misc/scratch2/####/${SGE_TASK_ID}.fa -out /misc/scratch2/####/${SGE_TASK_ID}.tsv -num_threads 1 -outfmt "6 qseqid sseqid evalue pident qcovs length slen qlen qstart qend sstart send stitle salltitles" -evalue 1e-10 -max_target_seqs 100000

conda deactivate

If you are familiar with ${SGE_TASK_ID}, you will know the real difficult is how to prepare each fasta file with the number as the name, e.g.1.fa, 2.fa, 3.fa, 4.fa. I collect some small but efficient scripts to realize that.

  1. Method one: using 'csplit' function
<code>
csplit -ks -n 4 -f 0 ###.fasta '/\>/' {2000}

# This will split the FASTA file into 2000 single fasta file with the file name like this 01,02,03..
# So in this case: -query /misc/scratch2/####/${SGE_TASK_ID}.fa
#  will be renamed to 
# -query /misc/scratch2/####/**0**{SGE_TASK_ID}
# Technically, you can change 0 to whatever you want it is just a file name prefix.
</code>
The specific usage and the meaning of csplit function can be found here: https://man7.org/linux/man-  pages/man1/csplit.1.html
  1. Method two: Run shell script split.sh
#!/bin/bash
# ./split.sh
#while IFS= read line
while read line
   
do
if [[ ${line:0:1} == '>' ]] then
outfile=${line#>}.fa
echo $line > $outfile else
echo $line >> $outfile fi

# be careful this can only be run once, because: '>' is to run once, 
# but '>>' is to continucal give value to the file

# display $line or do somthing with $line #echo "$line"
done < $1

How to test your data, simply run

./split.sh ##2000.fasta
# This also will split the large FASTA file into each pieces.

But what if we change the code to this, the CPUs can be then efficiently used.

<Last updated by Xi Zhang on Oct 8th,2021> Upcoming

bioinformatics_tools2.1633719721.txt.gz · Last modified: by 134.190.232.9