User Tools

Site Tools


bioinformatics_tools2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
bioinformatics_tools2 [2021/10/08 16:45] 134.190.232.9bioinformatics_tools2 [2022/02/28 11:53] (current) 134.190.232.106
Line 2: Line 2:
    
  
-**Approach One: submit array jobs**+**Approach One: submit for loop shell script** 
 + 
 +<code> 
 +#script: shell.sh 
 + 
 +#!/bin/bash 
 +#$ -S /bin/bash 
 +. /etc/profile 
 +#$ -cwd 
 +#$ -o logfile 
 +#$ -pe threaded 20 
 +#export PATH=/scratch2/software/anaconda/bin:$PATH 
 + 
 +while read line 
 +do 
 + 
 +mafft --auto --thread 20 /misc/scratch2/####/$line.fasta >/misc/scratch2/####/aligned/$line.aligned.fasta 
 + 
 +/scratch2/software/anaconda/envs/bmge/bin/bmge -i /misc/scratch2/####/aligned/$line.aligned.fasta -t AA -m BLOSUM30 -of /misc/scratch2/xizhang/####/trimmed/$line.aligned.trimmed.fasta 
 + 
 +FastTree /misc/scratch2/####/trimmed/$line.aligned.trimmed.fasta > /misc/scratch2/####/fasttree/$line.aligned.trimmed.newick 
 + 
 +done <$1 
 +</code> 
 + 
 +This script need you have a list of sequence name and sensitive with only ID. Run the script like this: 
 + 
 +Note: $line.ko.txt VS $line_ko.txt, the later one cannot be recognized due to "_" before ko.txt, so I suggest avoid "_" before ko.txt. 
 + 
 +<code> 
 +#pure name_list file of your fasta, e.g. 
 +Gene1 
 +Gene2 
 +Gene3 
 + 
 +#This can be easily acquired via : 
 +grep '>' ###.fasta|sed 's/>//g' > name_list.txt 
 + 
 +# If your FASTA seq includes gene descriptions e.g., directly retrieved from NCBI 
 +> gen1 hypothetical protein balabalala 
 +TAGTTAGTCGATCGTACGTA 
 + 
 +Simply run: awk '{print $1}' seq.fasta > clean_name_id.fasta 
 + 
 +#Then run the shell script. 
 +chmod +x shell.sh 
 +./shell.sh name_list.txt 
 +</code>  
 + 
 +# must leave one line break for the list.txt file, otherwise the last line will not be proceeded. 
 + 
 + 
 +**Approach Two: submit array jobs**
  
 Below is a real case to BLAST thousands of genes against NCBI-nr database. However, it could take weeks running if we BLAST whole gene against the nr database directly.  Below is a real case to BLAST thousands of genes against NCBI-nr database. However, it could take weeks running if we BLAST whole gene against the nr database directly. 
Line 44: Line 96:
 If you are familiar with ${SGE_TASK_ID}, you will know the real difficult is how to prepare each fasta file with the number as the name, e.g.1.fa, 2.fa, 3.fa, 4.fa. I collect some small but efficient scripts to realize that. If you are familiar with ${SGE_TASK_ID}, you will know the real difficult is how to prepare each fasta file with the number as the name, e.g.1.fa, 2.fa, 3.fa, 4.fa. I collect some small but efficient scripts to realize that.
  
-  - Method one: using 'csplit' function+    * Method one: using 'csplit' function
  
   <code>   <code>
Line 52: Line 104:
   # So in this case: -query /misc/scratch2/####/${SGE_TASK_ID}.fa   # So in this case: -query /misc/scratch2/####/${SGE_TASK_ID}.fa
   #  will be renamed to    #  will be renamed to 
-  # -query /misc/scratch2/####/**0**{SGE_TASK_ID}+  # -query /misc/scratch2/####/0{SGE_TASK_ID}
   # Technically, you can change 0 to whatever you want it is just a file name prefix.   # Technically, you can change 0 to whatever you want it is just a file name prefix.
   </code>   </code>
Line 83: Line 135:
 for f in 0* for f in 0*
 do do
- 
 python3 index_header_to_seq.py ####.fasta $f $f.fa  python3 index_header_to_seq.py ####.fasta $f $f.fa 
- 
 done done
  
Line 91: Line 141:
 </code> </code>
  
-  - Method two: Run shell script split.sh+    * Method two: Run shell script split.sh
  
 <code> <code>
Line 120: Line 170:
  
  
-But what if we change the code to this, the CPUs can be then efficiently used. +<Last updated by Xi Zhang on Oct 8th,2021> 
- +
-<Last updated by Xi Zhang on Oct 8th,2021> Upcoming+
bioinformatics_tools2.1633722324.txt.gz · Last modified: by 134.190.232.9