Differences

This shows you the differences between two versions of the page.

--- bioinformatics_tools2 [2021/10/08 16:45] – 134.190.232.9
+++ bioinformatics_tools2 [2022/02/28 11:53] (current) – 134.190.232.106
@@ Line 2: / Line 2: @@
-**Approach One: submit array jobs**
+**Approach One: submit for loop shell script**
+<code>
+#script: shell.sh
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -o logfile
+#$ -pe threaded 20
+#export PATH=/scratch2/software/anaconda/bin:$PATH
+while read line
+do
+mafft --auto --thread 20 /misc/scratch2/####/$line.fasta >/misc/scratch2/####/aligned/$line.aligned.fasta
+/scratch2/software/anaconda/envs/bmge/bin/bmge -i /misc/scratch2/####/aligned/$line.aligned.fasta -t AA -m BLOSUM30 -of /misc/scratch2/xizhang/####/trimmed/$line.aligned.trimmed.fasta
+FastTree /misc/scratch2/####/trimmed/$line.aligned.trimmed.fasta > /misc/scratch2/####/fasttree/$line.aligned.trimmed.newick
+done <$1
+</code>
+This script need you have a list of sequence name and sensitive with only ID. Run the script like this:
+Note: $line.ko.txt VS $line_ko.txt, the later one cannot be recognized due to "_" before ko.txt, so I suggest avoid "_" before ko.txt.
+<code>
+#pure name_list file of your fasta, e.g.
+Gene1
+Gene2
+Gene3
+#This can be easily acquired via :
+grep '>' ###.fasta|sed 's/>//g' > name_list.txt
+# If your FASTA seq includes gene descriptions e.g., directly retrieved from NCBI
+> gen1 hypothetical protein balabalala
+TAGTTAGTCGATCGTACGTA
+Simply run: awk '{print $1}' seq.fasta > clean_name_id.fasta
+#Then run the shell script.
+chmod +x shell.sh
+./shell.sh name_list.txt
+</code>
+# must leave one line break for the list.txt file, otherwise the last line will not be proceeded.
+**Approach Two: submit array jobs**
 Below is a real case to BLAST thousands of genes against NCBI-nr database. However, it could take weeks running if we BLAST whole gene against the nr database directly.
@@ Line 44: / Line 96: @@
 If you are familiar with ${SGE_TASK_ID}, you will know the real difficult is how to prepare each fasta file with the number as the name, e.g.1.fa, 2.fa, 3.fa, 4.fa. I collect some small but efficient scripts to realize that.
-  - Method one: using 'csplit' function
+    * Method one: using 'csplit' function
   <code>
@@ Line 52: / Line 104: @@
   # So in this case: -query /misc/scratch2/####/${SGE_TASK_ID}.fa
   #  will be renamed to
-  # -query /misc/scratch2/####/**0**{SGE_TASK_ID}
+  # -query /misc/scratch2/####/0{SGE_TASK_ID}
   # Technically, you can change 0 to whatever you want it is just a file name prefix.
   </code>
@@ Line 83: / Line 135: @@
 for f in 0*
 do
 python3 index_header_to_seq.py ####.fasta $f $f.fa
 done
@@ Line 91: / Line 141: @@
 </code>
-  - Method two: Run shell script split.sh
+    * Method two: Run shell script split.sh
 <code>
@@ Line 120: / Line 170: @@
-But what if we change the code to this, the CPUs can be then efficiently used.
+<Last updated by Xi Zhang on Oct 8th,2021>
-<Last updated by Xi Zhang on Oct 8th,2021> Upcoming