Differences

This shows you the differences between two versions of the page.

--- ortologs_searches_using_panther_hmmrs [2018/10/18 21:24] – 24.138.71.142
+++ ortologs_searches_using_panther_hmmrs [2023/05/04 10:50] (current) – [Step 6: Parsing HMM outpus and creating inputs for a second HMM search] 134.190.232.186
@@ Line 1: / Line 1: @@
+====== Orthology detection for metabolic pathways using genomic data and Panther data ======
-**Orthology detection for metabolic pathways using genomic data and Panther data** Preparared by D. E. Salas-Leiva
+//Preparared by D. E. Salas-Leiva//
-**Programs that will be used:**\\
+===== Required softwares and scripts =====
-- Blast and hmmr software to carry out homology searches.  Available at perun's environmental path\\
-- maftt, trimal, fastree used for tree reconstruction Available at perun's environmental path or at /home/dsalas/Shared/\\
+  - ''ncbi blast'' and ''hmmer'' for homology searches. These are available at perun's environmental path
-- python scripts written by Dayana at: /home/dsalas/Shared/RNAdecayCarp/Scripts\\
+  - ''maftt'', ''trimal'', ''fastree'' for tree reconstruction. These are available at perun's environmental path or at ''/home/dsalas/Shared/''
-  * Blastp_search.sh
+  - the following python scripts written by Dayana. These are available at ''/home/dsalas/Shared/RNAdecayCarp/Scripts''
-  * BlastParser.by.Pident.py  use python36-generic
+      * ''Blastp_search.sh''
-  * ShellMaker2.py  use python36-generic
+      * ''BlastParser.by.Pident.py''  //uses python36-generic//
-  * Tsv_parser.py  use python36-generic
+      * ''ShellMaker2.py''            //uses python36-generic//
-  * SeqCollector.py  use python36-generic
+      * ''Tsv_parser.py''             //uses python36-generic//
-  * ETE_standAlone1.4.py  use python27-generic
+      * ''SeqCollector.py''           //uses python36-generic//
+      * ''ETE_standAlone1.4.py''      //uses python27-generic//
+===== Initial information required =====
+==== Query sequences in fasta format ====
-**Initial information required**\\
-- Query sequences in fasta format:\\
 These **MUST** belong to proteins with experimental evidence from **MODEL ORGANISMS**. This information should be gathered through literature searches, as well as download the information from specialized databases. Query sequences from model organisms in which the pathway of interest has been very well studied.
-- Databases:\\
+==== Databases ====
-- Panther data. For the current workflow 103 proteomes were gathered and reparsed from Uniprot (it contains Archaea, Bacteria, Eukaryota). This data is organized as:\\
+**Panther data**\\
+For the current workflow 103 proteomes were gathered and reparsed from Uniprot (it contains Archaea, Bacteria, Eukaryota). This data is organized as:\\
   * A full proteome blast-able database at:  /scratch3/rogerlab_databases/other_dbs/UniProt103.fasta
@@ Line 26: / Line 31: @@
   * A full panther database containing hmm profiles by panther superfamily at: /db1/extra-data-sets/panther/PANTHER13.1/books/
   * A full panther database containing fasta files by panther superfamily at: /scratch3/rogerlab_databases/other_dbs/UniProt103_byPTHRfam
--  Pfam-A hmmr database at: /scratch3/rogerlab_databases/other_dbs/ Pfam-A.hmm
-- Predicted proteomes:\\
+**Pfam-A**\\
-The directory containing the predicted proteomes is called 'Pred_Proteomes_playground'.\\
+hmmr database at: ''/scratch3/rogerlab_databases/other_dbs/Pfam-A.hmm''
-   For Carperdiemonas membranifera paper these are the predicted proteomes:
-   Carp_sept19.aa.fasta   Cfrisia.masked.aa.fasta   GinA_50803.aa.fasta   GinB_50581.aa.fasta   Gmuris.aa.fasta
+**Predicted proteomes**\\
-   Kipferlia.aa.fasta   Monoc.aa.fasta   SSAL.aa.fasta   Trepo.aa.fasta   Trich.aa.fasta
+The directory containing the predicted proteomes is called ''Pred_Proteomes_playground''.\\
+For //Carperdiemonas membranifera// paper these are the predicted proteomes:
+<code>
+Carp_sept19.aa.fasta
+Cfrisia.masked.aa.fasta
+GinA_50803.aa.fasta
+GinB_50581.aa.fasta
+Gmuris.aa.fasta
+Kipferlia.aa.fasta
+Monoc.aa.fasta
+SSAL.aa.fasta
+Trepo.aa.fasta
+Trich.aa.fasta
+</code>
+===== Protocol overview =====
+  - Blast queries (//experimentally characterized proteins of interest from model organisms//) against Uniprot103 database to get the best matching sequence
+  - Obtain the PANTHER classification for that sequence (PANTHER **superfamily** and **subfamily**)
+  - Scan the predicted proteomes using the hmm-profile corresponding to the **subfamily**
+  - Parse the hmm output: empty files and files with results
+  - Redo the scan on the predicted proteomes but this time using the hmm-profile corresponding to the **superfamily**
+  - Parse the hmm output: empty files and files with results
+  - Concatenate the parsed results from step 4
+  - Retrieve the sequences mentioned from each predicted proteome by PANTHER **superfamily**
+  - Create a working fasta file with each of the candidates and PANTHER files from UniProt103_byPTHRfam (see above)
+  - Create shells for sequence aligment, trimming and tree search
+  - Tree search
+  - Map protein domain architecture to each tree and build a pdf file by PANTHER **superfamily**
+  - Creating a tabulated file to keep track of the findings
+  - Move the pdf to your desktop for tree inspection and orthology assigment.
-**Protocol overview:**\\
+===== Detailed Protocol =====
-- Blast queries against Uniprot103 database to get the best matching sequence\\
-- Obtain the panther classification for that sequence (panther super-family and subfamily)\\
-- Scan the predicted proteomes using the hmm profile corresponding to the **subfamily**\\
-- Parse the hmm output: empty files and files with results\\
-- Redo the scan on the predicted proteomes but this time using the hmm profile corresponding to the **superfamily**\\
-- Parse the hmm output: empty files and files with results, Concatenate the parsed results from step 4 and 6, Retrieve the sequences mentioned from each predicted proteome by Panther super-family, Create a working fasta file with each of the candidates and panther files from UniProt103_byPTHRfam (see above), Create shells for sequence aligment, trimming and tree search\\
-- Tree search
-- Map protein domain architecture to each tree and build a pdf file by panther super-family\\
-- Creating a tabulated file to keep track of the findings\\
-- Move the pdf to your desktop for tree inspection and orthology assigment.\\
+NOTE: The pathway of interest for this example is the **RNA decay pathway**, query file name is ''RNAreduced.seqs'' and Metadata associated is ''RNAreduced.METADATA''
-**Detailed Protocol:**\\
+BEFORE YOU START: Make a working directory, copy the predicted proteomes, queries and metadata, and the scripts for this workflow:
-Note: The pathway of interest for this example is the RNA decay pathway, query file name is RNAreduced.seqs and Metadata associated is RNAreduced.METADATA\\
-**BEFORE TO START: make a working directory, copy the predicted proteomes, queries and metadata, and the scripts for this workflow:** \\
+<code>
-   mkdir Pred_Proteomes_playground
+mkdir Pred_Proteomes_playground
-   cp Queries/* Pred_Proteomes_playground/
+cp Queries/* Pred_Proteomes_playground/
-   cp Comparative_Genomics/* Pred_Proteomes_playground/
+cp Comparative_Genomics/* Pred_Proteomes_playground/
-   cp *.py Blastp*.sh Pred_Proteomes_playground/
+cp *.py Blastp*.sh Pred_Proteomes_playground/
-   cd Pred_Proteomes_playground
+cd Pred_Proteomes_playground
+</code>
+====  Step 1: Blast vs Uniprot103 ====
-**Step 1:** Blast vs Uniprot103 \\
 Run blast search against UniProt103.fasta to find the closest Panther superfamily for each of the queries from model organisms and then parse the blast output by identity and e-value:\\
-The query file name is **RNAreduced.seqs**, the blast output will be called **RNA.reduced.blastout** and the shell is **Blastp_search.sh**\\
+The query file name is **RNAreduced.seqs**, the blast output will be called **RNA.reduced.blastout** and the shell is **Blastp_search.sh**
 Parse the blast output by identity and e-value:
-Given the queries you are using (see Initial information required) it is highly likely that these sequences will find themselves during the blast search against UniProt103.fasta
+Given the queries you are using (see //Initial information required//) it is highly likely that these sequences will find themselves during the blast search against ''UniProt103.fasta''
-[For more information on species contained in Panther see the file PTHR_103classification.tsv or visit http://www.pantherdb.org/].
+For more information on species contained in Panther see the file ''PTHR_103classification.tsv'' or visit http://www.pantherdb.org/.
-   source activate python36-generic
+<code>
-   python BlastParser.by.Pident.py RNA.reduced.blastout
+source activate python36-generic
-   source deactivate
+python BlastParser.by.Pident.py RNA.reduced.blastout
+source deactivate
+</code>
-The resulting file 'RNA.reduced.blastout_pparsed.tab' should contain one result by query. To check this, use the following commands:
+The resulting file ''RNA.reduced.blastout_pparsed.tab'' should contain one result by query. To check this, use the following commands:
 To figure out how many sequences you started with:
    grep ‘>’ RNAreduced.seqs | wc -l
-   outputs is -> 97 RNAreduced.seqs
+   # outputs is -> 97 RNAreduced.seqs
 To figure out how many blast hits output you have:
    wc -l RNA.reduced.blastout
-   outputs is -> 99 RNA.reduced.blastout
+   # outputs is -> 99 RNA.reduced.blastout
    wc -l RNA.reduced.blastout_pparsed.tab
-   outputs is -> 98 RNA.reduced.blastout_pparsed.tab
+   # outputs is -> 98 RNA.reduced.blastout_pparsed.tab
-Note: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far.
+NOTE: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far.
 If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists.
-**Step 2:** Getting the panther codes
+==== Step 2: Getting the PANTHER codes ====
-Obtain the codes for each panther super-family and subfamily for each query by using the command commands below and the PTHR_103classification.tsv to create a customized file for your queries.
+Obtain the codes for each PANTHER superfamily and subfamily for each query by using the command commands below and the ''PTHR_103classification.tsv'' to create a customized file for your queries.
 Post-processing the blast output to get panther information:
 ) Separate the first column that correspond to the queries accession numbers:
    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f1 > queries_acc
 ) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv):
    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f2|cut -d '|' -f2 > hits_acc
 ) Create a file containing both columns for future crosschecking:
    paste -d $'\t' queries_acc hits_acc > query_hits_columns
-)  remove the header of query_hits_columns:
+) Remove the header of query_hits_columns:
     sed -i '/query ID\tsubject ID/d' query_hits_columns
-Getting the information from panther classification:
-Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the PTHR_103classification.tsv:
+Getting the information from PANTHER classification:
+Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the ''PTHR_103classification.tsv'':
 ) create a file containing the panther information 97 queries:
    grep -w -F -f hits_acc /scratch3/rogerlab_databases/other_dbs/PTHR_103classification.tsv > Panther97queries_hit_info.tsv
 ) Create a tsv file containing identify the panther families by hits_acc:
    cut -d $'\t' -f1,2,3 Panther97queries_hit_info.tsv |cut -d '|' -f3,4|cut -d '=' -f2 > Pantherby97Uniprotaccession.tsv
 ) eliminate extra tabulations in the file:
    sed -i.bak 's/\t\t/\t/g' Pantherby97Uniprotaccession.tsv
    sed -i 's/:/_/g' Pantherby97Uniprotaccession.tsv
 ) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’:
@@ Line 126: / Line 168: @@
-**Step 3:** Creating shells and running a HMMR search by **PANTHER SUBFAMILY**\\
+==== Step 3: Creating shells and running a HMMR search by PANTHER SUBFAMILY ====
 ) You are ready to create the master shell for the hmmr search:
    source activate python36-generic
@@ Line 144: / Line 187: @@
-**Step 4:** Parsing HMM outputs and creating inputs for a second HMM search\\
+==== Step 4: Parsing HMM outputs and creating inputs for a second HMM search ====
 Parsing the HMM outputs
    source activate python36-generic
@@ Line 157: / Line 201: @@
-**Step 5:**  Creating shells and running a HMMR search
+==== Step 5:  Creating shells and running a HMMR search ====
 ) You are ready to create the master shell for the hmmr search by **PANTHER SUPERFAMILY**:
@@ Line 166: / Line 210: @@
 The script will produce a file called ‘Panther.HmmrSearch.sh2’ that contains all the command lines required to search the queries that did not have hits during the search carried out in the step 3.
 ) Qsub the Shell. please avoid lower ram nodes.
-   qsub Panther.HmmrSearch2.sh
+   qsub Panther.HmmrSearch.sh2
-**Step 6:** Parsing HMM outpus and creating inputs for a second HMM search
+==== Step 6: Parsing HMM outputs and creating inputs for a second HMM search ====
 Parsing the HMM outputs
    source activate python36-generic
@@ Line 183: / Line 228: @@
 ) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script
-**Step 6:** Start the tree search by submitting your jobs:
+==== Step 7: Start the tree search by submitting your jobs ====
    ls -1 *Reconstruction.sh > list_of_shells
    for i in `cat list_of_shells`; do qsub $i; done
-**Step 8:** Map protein domain architecture to each tree and build a pdf file by panther super-family
+==== Step 8: Map protein domain architecture to each tree and build a pdf file by panther super-family ====
 ) Create an input file separated by tabs containing a list records by line following this format: fastafile treefile
     source activate python27-generic
     xvfb-run -a python ETE_standAlone1.4.py Input4ETE
     source deactivate
-**Step 9:** Creating a tabulated file to keep track of the findings.
+==== Step 9: Creating a tabulated file to keep track of the findings. ====
    NOTE: you will need a metadata file. it may consist of accession numbers and a fasta header. please see the format of the metadata provided for this example 'RNAreduced.METADATA'
    source activate python36-generic
@@ Line 198: / Line 247: @@
    source deactivate
-**DON'T FORGET to Check how many searches were not done due to errors: grep 'Error:' Panther.HmmrSearch.sh2.e*|sort -u**
+**DON'T FORGET to Check how many searches were not done due to errors:** \\
+  grep 'Error:' Panther.HmmrSearch.sh2.e* |sort -u \\
+  Error: File existence/permissions problem in trying to open HMM file /db1/extra-data-sets/panther/PANTHER13.1/books/PTHR44316/hmmer.hmm.
-Example
+==== Step 10: Move the pdf files and 'MAIN_TABLE.txt' to your desktop for manual tree inspection and orthology assignment. ====
-grep 'Error:' Panther.HmmrSearch.sh2.e* |sort -u \\
-Error: File existence/permissions problem in trying to open HMM file /db1/extra-data-sets/panther/PANTHER13.1/books/PTHR44316/hmmer.hmm.
-**Step 10:** Move the pdf files and 'MAIN_TABLE.txt' to your desktop for manual tree inspection and orthology assignment.