Differences

This shows you the differences between two versions of the page.

--- bioinformatics_tools2 [2021/10/08 15:39] – 134.190.232.9
+++ bioinformatics_tools2 [2022/02/28 11:53] (current) – 134.190.232.106
@@ Line 2: / Line 2: @@
-**Approach One: submit array jobs**
+**Approach One: submit for loop shell script**
+<code>
+#script: shell.sh
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -o logfile
+#$ -pe threaded 20
+#export PATH=/scratch2/software/anaconda/bin:$PATH
+while read line
+do
+mafft --auto --thread 20 /misc/scratch2/####/$line.fasta >/misc/scratch2/####/aligned/$line.aligned.fasta
+/scratch2/software/anaconda/envs/bmge/bin/bmge -i /misc/scratch2/####/aligned/$line.aligned.fasta -t AA -m BLOSUM30 -of /misc/scratch2/xizhang/####/trimmed/$line.aligned.trimmed.fasta
+FastTree /misc/scratch2/####/trimmed/$line.aligned.trimmed.fasta > /misc/scratch2/####/fasttree/$line.aligned.trimmed.newick
+done <$1
+</code>
+This script need you have a list of sequence name and sensitive with only ID. Run the script like this:
+Note: $line.ko.txt VS $line_ko.txt, the later one cannot be recognized due to "_" before ko.txt, so I suggest avoid "_" before ko.txt.
+<code>
+#pure name_list file of your fasta, e.g.
+Gene1
+Gene2
+Gene3
+#This can be easily acquired via :
+grep '>' ###.fasta|sed 's/>//g' > name_list.txt
+# If your FASTA seq includes gene descriptions e.g., directly retrieved from NCBI
+> gen1 hypothetical protein balabalala
+TAGTTAGTCGATCGTACGTA
+Simply run: awk '{print $1}' seq.fasta > clean_name_id.fasta
+#Then run the shell script.
+chmod +x shell.sh
+./shell.sh name_list.txt
+</code>
+# must leave one line break for the list.txt file, otherwise the last line will not be proceeded.
+**Approach Two: submit array jobs**
 Below is a real case to BLAST thousands of genes against NCBI-nr database. However, it could take weeks running if we BLAST whole gene against the nr database directly.
@@ Line 37: / Line 89: @@
 DB=nr
-blastp -db $DB -query /misc/scratch2/####/${SGE_TASK_ID}.fa -out /misc/scratch2/####/${SGE_TASK_ID}.ICE-L.tsv -num_threads 1 -outfmt "6 qseqid sseqid evalue pident qcovs length slen qlen qstart qend sstart send stitle salltitles" -evalue 1e-10 -max_target_seqs 100000
+blastp -db $DB -query /misc/scratch2/####/${SGE_TASK_ID}.fa -out /misc/scratch2/####/${SGE_TASK_ID}.tsv -num_threads 1 -outfmt "6 qseqid sseqid evalue pident qcovs length slen qlen qstart qend sstart send stitle salltitles" -evalue 1e-10 -max_target_seqs 100000
 conda deactivate
 </code>
+If you are familiar with ${SGE_TASK_ID}, you will know the real difficult is how to prepare each fasta file with the number as the name, e.g.1.fa, 2.fa, 3.fa, 4.fa. I collect some small but efficient scripts to realize that.
+    * Method one: using 'csplit' function
+  <code>
+  csplit -ks -n 4 -f 0 ###.fasta '/\>/' {2000}
+  # This will split the FASTA file into 2000 single fasta file with the file name like this 01,02,03..
+  # So in this case: -query /misc/scratch2/####/${SGE_TASK_ID}.fa
+  #  will be renamed to
+  # -query /misc/scratch2/####/0{SGE_TASK_ID}
+  # Technically, you can change 0 to whatever you want it is just a file name prefix.
+  </code>
+The specific usage and the meaning of csplit function can be found here:
+ https://man7.org/linux/man-pages/man1/csplit.1.html. People might ask what if I don't want to split them into one for each, how about 10 genes together in file. Absolutely! Try these three steps:
+<code>
+# Step ONE, you need a pure name list file of your fasta, e.g.
+Gene1
+Gene2
+Gene3
+#This can be easily acquired via :
+grep '>' ###.fasta|sed 's/>//g' > name_list.txt
+#Step TWO use csplit to cut the name_list.txt into whatever-the-number-you-want pieces.
+csplit -ks -n 4 -f 0 name_lsit.txt 10 {1999}
+# The number 10 here is to split the name file every ten lines. you can change to e.g., 4 lines, 6 lines..
+# Step Three
+# use for loop to run the python script
+index_header_to_seq.py (https://github.com/zx0223winner/NoBadWordsCombiner/blob/main/Tutorial/index_header_to_seq.py
+#!/bin/bash
+#$ -S /bin/bash
+for f in 0*
+do
+python3 index_header_to_seq.py ####.fasta $f $f.fa
+done
+# it tested OK.
+</code>
+    * Method two: Run shell script split.sh
+<code>
+#!/bin/bash
+# ./split.sh
+#while IFS= read line
+while read line
+do
+if [[ ${line:0:1} == '>' ]] then
+outfile=${line#>}.fa
+echo $line > $outfile else
+echo $line >> $outfile fi
+# be careful this can only be run once, because: '>' is to run once,
+# but '>>' is to continucal give value to the file
+# display $line or do somthing with $line #echo "$line"
+done < $1
+</code>
+How to test your data, simply run
+<code>
+./split.sh ##2000.fasta
+# This also will split the large FASTA file into each pieces.
+</code>
-But what if we change the code to this, the CPUs can be then efficiently used.
-<Last updated by Xi Zhang on Oct 8th,2021> Upcoming
+<Last updated by Xi Zhang on Oct 8th,2021>