User Tools

Site Tools


ortologs_searches_using_panther_hmmrs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ortologs_searches_using_panther_hmmrs [2023/04/27 14:58] – [Detailed Protocol] 134.190.232.186ortologs_searches_using_panther_hmmrs [2023/05/04 10:50] (current) – [Step 6: Parsing HMM outpus and creating inputs for a second HMM search] 134.190.232.186
Line 1: Line 1:
-===== Orthology detection for metabolic pathways using genomic data and Panther data=====+====== Orthology detection for metabolic pathways using genomic data and Panther data ======
  
 //Preparared by D. E. Salas-Leiva// //Preparared by D. E. Salas-Leiva//
  
-==== Required softwares and scripts ====+===== Required softwares and scripts =====
  
   - ''ncbi blast'' and ''hmmer'' for homology searches. These are available at perun's environmental path   - ''ncbi blast'' and ''hmmer'' for homology searches. These are available at perun's environmental path
Line 15: Line 15:
       * ''ETE_standAlone1.4.py''      //uses python27-generic//       * ''ETE_standAlone1.4.py''      //uses python27-generic//
  
-==== Initial information required ====+===== Initial information required =====
  
-=== Query sequences in fasta format ===+==== Query sequences in fasta format ====
  
 These **MUST** belong to proteins with experimental evidence from **MODEL ORGANISMS**. This information should be gathered through literature searches, as well as download the information from specialized databases. Query sequences from model organisms in which the pathway of interest has been very well studied. These **MUST** belong to proteins with experimental evidence from **MODEL ORGANISMS**. This information should be gathered through literature searches, as well as download the information from specialized databases. Query sequences from model organisms in which the pathway of interest has been very well studied.
  
  
-=== Databases ===+==== Databases ====
  
 **Panther data**\\ **Panther data**\\
Line 52: Line 52:
 </code> </code>
  
-==== Protocol overview ====+===== Protocol overview =====
  
   - Blast queries (//experimentally characterized proteins of interest from model organisms//) against Uniprot103 database to get the best matching sequence   - Blast queries (//experimentally characterized proteins of interest from model organisms//) against Uniprot103 database to get the best matching sequence
Line 70: Line 70:
  
  
-==== Detailed Protocol ====+===== Detailed Protocol =====
  
 NOTE: The pathway of interest for this example is the **RNA decay pathway**, query file name is ''RNAreduced.seqs'' and Metadata associated is ''RNAreduced.METADATA'' NOTE: The pathway of interest for this example is the **RNA decay pathway**, query file name is ''RNAreduced.seqs'' and Metadata associated is ''RNAreduced.METADATA''
Line 84: Line 84:
 </code> </code>
  
-===  Step 1: Blast vs Uniprot103 ===+====  Step 1: Blast vs Uniprot103 ====
  
 Run blast search against UniProt103.fasta to find the closest Panther superfamily for each of the queries from model organisms and then parse the blast output by identity and e-value:\\ Run blast search against UniProt103.fasta to find the closest Panther superfamily for each of the queries from model organisms and then parse the blast output by identity and e-value:\\
Line 93: Line 93:
 For more information on species contained in Panther see the file ''PTHR_103classification.tsv'' or visit http://www.pantherdb.org/. For more information on species contained in Panther see the file ''PTHR_103classification.tsv'' or visit http://www.pantherdb.org/.
  
-   source activate python36-generic +<code> 
-   python BlastParser.by.Pident.py RNA.reduced.blastout +source activate python36-generic 
-   source deactivate+python BlastParser.by.Pident.py RNA.reduced.blastout 
 +source deactivate 
 +</code>
  
-The resulting file 'RNA.reduced.blastout_pparsed.tab' should contain one result by query. To check this, use the following commands:+The resulting file ''RNA.reduced.blastout_pparsed.tab'' should contain one result by query. To check this, use the following commands:
  
 To figure out how many sequences you started with: To figure out how many sequences you started with:
    grep ‘>’ RNAreduced.seqs | wc -l    grep ‘>’ RNAreduced.seqs | wc -l
-   outputs is -> 97 RNAreduced.seqs+   outputs is -> 97 RNAreduced.seqs 
 To figure out how many blast hits output you have: To figure out how many blast hits output you have:
    wc -l RNA.reduced.blastout    wc -l RNA.reduced.blastout
-   outputs is -> 99 RNA.reduced.blastout+   outputs is -> 99 RNA.reduced.blastout 
 +   
    wc -l RNA.reduced.blastout_pparsed.tab    wc -l RNA.reduced.blastout_pparsed.tab
-   outputs is -> 98 RNA.reduced.blastout_pparsed.tab +   outputs is -> 98 RNA.reduced.blastout_pparsed.tab 
-Note: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far.+   
 +NOTE: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far.
 If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists. If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists.
  
-**Step 2:** Getting the panther codes +==== Step 2: Getting the PANTHER codes ==== 
-Obtain the codes for each panther super-family and subfamily for each query by using the command commands below and the PTHR_103classification.tsv to create a customized file for your queries.+ 
 +Obtain the codes for each PANTHER superfamily and subfamily for each query by using the command commands below and the ''PTHR_103classification.tsv'' to create a customized file for your queries.
  
 Post-processing the blast output to get panther information: Post-processing the blast output to get panther information:
 +
 1) Separate the first column that correspond to the queries accession numbers: 1) Separate the first column that correspond to the queries accession numbers:
    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f1 > queries_acc    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f1 > queries_acc
 +
 2) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv): 2) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv):
    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f2|cut -d '|' -f2 > hits_acc      cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f2|cut -d '|' -f2 > hits_acc  
 +
 3) Create a file containing both columns for future crosschecking: 3) Create a file containing both columns for future crosschecking:
    paste -d $'\t' queries_acc hits_acc > query_hits_columns    paste -d $'\t' queries_acc hits_acc > query_hits_columns
-4)  remove the header of query_hits_columns:+ 
 +4) Remove the header of query_hits_columns:
     sed -i '/query ID\tsubject ID/d' query_hits_columns     sed -i '/query ID\tsubject ID/d' query_hits_columns
-Getting the information from panther classification: + 
-Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the PTHR_103classification.tsv:+Getting the information from PANTHER classification: 
 +Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the ''PTHR_103classification.tsv'': 
 5) create a file containing the panther information 97 queries: 5) create a file containing the panther information 97 queries:
    grep -w -F -f hits_acc /scratch3/rogerlab_databases/other_dbs/PTHR_103classification.tsv > Panther97queries_hit_info.tsv      grep -w -F -f hits_acc /scratch3/rogerlab_databases/other_dbs/PTHR_103classification.tsv > Panther97queries_hit_info.tsv  
 +
 6) Create a tsv file containing identify the panther families by hits_acc: 6) Create a tsv file containing identify the panther families by hits_acc:
    cut -d $'\t' -f1,2,3 Panther97queries_hit_info.tsv |cut -d '|' -f3,4|cut -d '=' -f2 > Pantherby97Uniprotaccession.tsv    cut -d $'\t' -f1,2,3 Panther97queries_hit_info.tsv |cut -d '|' -f3,4|cut -d '=' -f2 > Pantherby97Uniprotaccession.tsv
 +
 7) eliminate extra tabulations in the file: 7) eliminate extra tabulations in the file:
    sed -i.bak 's/\t\t/\t/g' Pantherby97Uniprotaccession.tsv    sed -i.bak 's/\t\t/\t/g' Pantherby97Uniprotaccession.tsv
    sed -i 's/:/_/g' Pantherby97Uniprotaccession.tsv    sed -i 's/:/_/g' Pantherby97Uniprotaccession.tsv
 +
 8) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’: 8) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’:
  
Line 153: Line 168:
  
  
-**Step 3:** Creating shells and running a HMMR search by **PANTHER SUBFAMILY**\\+==== Step 3: Creating shells and running a HMMR search by PANTHER SUBFAMILY ==== 
 1) You are ready to create the master shell for the hmmr search: 1) You are ready to create the master shell for the hmmr search:
    source activate python36-generic    source activate python36-generic
Line 171: Line 187:
        
  
-**Step 4:** Parsing HMM outputs and creating inputs for a second HMM search\\+==== Step 4: Parsing HMM outputs and creating inputs for a second HMM search ==== 
 Parsing the HMM outputs Parsing the HMM outputs
    source activate python36-generic    source activate python36-generic
Line 184: Line 201:
  
        
-**Step 5:**  Creating shells and running a HMMR search+==== Step 5:  Creating shells and running a HMMR search ====
  
 1) You are ready to create the master shell for the hmmr search by **PANTHER SUPERFAMILY**: 1) You are ready to create the master shell for the hmmr search by **PANTHER SUPERFAMILY**:
Line 195: Line 212:
    qsub Panther.HmmrSearch.sh2    qsub Panther.HmmrSearch.sh2
    
-**Step 6:** Parsing HMM outpus and creating inputs for a second HMM search+==== Step 6: Parsing HMM outputs and creating inputs for a second HMM search ==== 
 Parsing the HMM outputs Parsing the HMM outputs
    source activate python36-generic    source activate python36-generic
Line 210: Line 228:
    7) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script    7) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script
        
-**Step 6:** Start the tree search by submitting your jobs:+==== Step 7: Start the tree search by submitting your jobs ==== 
    ls -1 *Reconstruction.sh > list_of_shells    ls -1 *Reconstruction.sh > list_of_shells
    for i in `cat list_of_shells`; do qsub $i; done    for i in `cat list_of_shells`; do qsub $i; done
        
-**Step 8:** Map protein domain architecture to each tree and build a pdf file by panther super-family+==== Step 8: Map protein domain architecture to each tree and build a pdf file by panther super-family ==== 
     1) Create an input file separated by tabs containing a list records by line following this format: fastafile treefile     1) Create an input file separated by tabs containing a list records by line following this format: fastafile treefile
     source activate python27-generic     source activate python27-generic
     xvfb-run -a python ETE_standAlone1.4.py Input4ETE     xvfb-run -a python ETE_standAlone1.4.py Input4ETE
     source deactivate     source deactivate
-**Step 9:** Creating a tabulated file to keep track of the findings. +     
 +==== Step 9: Creating a tabulated file to keep track of the findings. ==== 
 + 
    NOTE: you will need a metadata file. it may consist of accession numbers and a fasta header. please see the format of the metadata provided for this example 'RNAreduced.METADATA'    NOTE: you will need a metadata file. it may consist of accession numbers and a fasta header. please see the format of the metadata provided for this example 'RNAreduced.METADATA'
    source activate python36-generic    source activate python36-generic
Line 230: Line 252:
   Error: File existence/permissions problem in trying to open HMM file /db1/extra-data-sets/panther/PANTHER13.1/books/PTHR44316/hmmer.hmm.   Error: File existence/permissions problem in trying to open HMM file /db1/extra-data-sets/panther/PANTHER13.1/books/PTHR44316/hmmer.hmm.
  
-**Step 10:** Move the pdf files and 'MAIN_TABLE.txt' to your desktop for manual tree inspection and orthology assignment.+==== Step 10: Move the pdf files and 'MAIN_TABLE.txt' to your desktop for manual tree inspection and orthology assignment. ==== 
      
  
  
  
ortologs_searches_using_panther_hmmrs.1682618288.txt.gz · Last modified: by 134.190.232.186