User Tools

Site Tools


ortologs_searches_using_panther_hmmrs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ortologs_searches_using_panther_hmmrs [2023/04/27 15:07] 134.190.232.186ortologs_searches_using_panther_hmmrs [2023/05/04 10:50] (current) – [Step 6: Parsing HMM outpus and creating inputs for a second HMM search] 134.190.232.186
Line 93: Line 93:
 For more information on species contained in Panther see the file ''PTHR_103classification.tsv'' or visit http://www.pantherdb.org/. For more information on species contained in Panther see the file ''PTHR_103classification.tsv'' or visit http://www.pantherdb.org/.
  
-   source activate python36-generic +<code> 
-   python BlastParser.by.Pident.py RNA.reduced.blastout +source activate python36-generic 
-   source deactivate+python BlastParser.by.Pident.py RNA.reduced.blastout 
 +source deactivate 
 +</code>
  
-The resulting file 'RNA.reduced.blastout_pparsed.tab' should contain one result by query. To check this, use the following commands:+The resulting file ''RNA.reduced.blastout_pparsed.tab'' should contain one result by query. To check this, use the following commands:
  
 To figure out how many sequences you started with: To figure out how many sequences you started with:
    grep ‘>’ RNAreduced.seqs | wc -l    grep ‘>’ RNAreduced.seqs | wc -l
-   outputs is -> 97 RNAreduced.seqs+   outputs is -> 97 RNAreduced.seqs 
 To figure out how many blast hits output you have: To figure out how many blast hits output you have:
    wc -l RNA.reduced.blastout    wc -l RNA.reduced.blastout
-   outputs is -> 99 RNA.reduced.blastout+   outputs is -> 99 RNA.reduced.blastout 
 +   
    wc -l RNA.reduced.blastout_pparsed.tab    wc -l RNA.reduced.blastout_pparsed.tab
-   outputs is -> 98 RNA.reduced.blastout_pparsed.tab +   outputs is -> 98 RNA.reduced.blastout_pparsed.tab 
-Note: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far.+   
 +NOTE: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far.
 If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists. If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists.
  
-**Step 2:** Getting the panther codes +==== Step 2: Getting the PANTHER codes ==== 
-Obtain the codes for each panther super-family and subfamily for each query by using the command commands below and the PTHR_103classification.tsv to create a customized file for your queries.+ 
 +Obtain the codes for each PANTHER superfamily and subfamily for each query by using the command commands below and the ''PTHR_103classification.tsv'' to create a customized file for your queries.
  
 Post-processing the blast output to get panther information: Post-processing the blast output to get panther information:
 +
 1) Separate the first column that correspond to the queries accession numbers: 1) Separate the first column that correspond to the queries accession numbers:
    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f1 > queries_acc    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f1 > queries_acc
 +
 2) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv): 2) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv):
    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f2|cut -d '|' -f2 > hits_acc      cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f2|cut -d '|' -f2 > hits_acc  
 +
 3) Create a file containing both columns for future crosschecking: 3) Create a file containing both columns for future crosschecking:
    paste -d $'\t' queries_acc hits_acc > query_hits_columns    paste -d $'\t' queries_acc hits_acc > query_hits_columns
-4)  remove the header of query_hits_columns:+ 
 +4) Remove the header of query_hits_columns:
     sed -i '/query ID\tsubject ID/d' query_hits_columns     sed -i '/query ID\tsubject ID/d' query_hits_columns
-Getting the information from panther classification: + 
-Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the PTHR_103classification.tsv:+Getting the information from PANTHER classification: 
 +Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the ''PTHR_103classification.tsv'': 
 5) create a file containing the panther information 97 queries: 5) create a file containing the panther information 97 queries:
    grep -w -F -f hits_acc /scratch3/rogerlab_databases/other_dbs/PTHR_103classification.tsv > Panther97queries_hit_info.tsv      grep -w -F -f hits_acc /scratch3/rogerlab_databases/other_dbs/PTHR_103classification.tsv > Panther97queries_hit_info.tsv  
 +
 6) Create a tsv file containing identify the panther families by hits_acc: 6) Create a tsv file containing identify the panther families by hits_acc:
    cut -d $'\t' -f1,2,3 Panther97queries_hit_info.tsv |cut -d '|' -f3,4|cut -d '=' -f2 > Pantherby97Uniprotaccession.tsv    cut -d $'\t' -f1,2,3 Panther97queries_hit_info.tsv |cut -d '|' -f3,4|cut -d '=' -f2 > Pantherby97Uniprotaccession.tsv
 +
 7) eliminate extra tabulations in the file: 7) eliminate extra tabulations in the file:
    sed -i.bak 's/\t\t/\t/g' Pantherby97Uniprotaccession.tsv    sed -i.bak 's/\t\t/\t/g' Pantherby97Uniprotaccession.tsv
    sed -i 's/:/_/g' Pantherby97Uniprotaccession.tsv    sed -i 's/:/_/g' Pantherby97Uniprotaccession.tsv
 +
 8) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’: 8) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’:
  
Line 153: Line 168:
  
  
-**Step 3:** Creating shells and running a HMMR search by **PANTHER SUBFAMILY**\\+==== Step 3: Creating shells and running a HMMR search by PANTHER SUBFAMILY ==== 
 1) You are ready to create the master shell for the hmmr search: 1) You are ready to create the master shell for the hmmr search:
    source activate python36-generic    source activate python36-generic
Line 171: Line 187:
        
  
-**Step 4:** Parsing HMM outputs and creating inputs for a second HMM search\\+==== Step 4: Parsing HMM outputs and creating inputs for a second HMM search ==== 
 Parsing the HMM outputs Parsing the HMM outputs
    source activate python36-generic    source activate python36-generic
Line 184: Line 201:
  
        
-**Step 5:**  Creating shells and running a HMMR search+==== Step 5:  Creating shells and running a HMMR search ====
  
 1) You are ready to create the master shell for the hmmr search by **PANTHER SUPERFAMILY**: 1) You are ready to create the master shell for the hmmr search by **PANTHER SUPERFAMILY**:
Line 195: Line 212:
    qsub Panther.HmmrSearch.sh2    qsub Panther.HmmrSearch.sh2
    
-**Step 6:** Parsing HMM outpus and creating inputs for a second HMM search+==== Step 6: Parsing HMM outputs and creating inputs for a second HMM search ==== 
 Parsing the HMM outputs Parsing the HMM outputs
    source activate python36-generic    source activate python36-generic
Line 210: Line 228:
    7) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script    7) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script
        
-**Step 6:** Start the tree search by submitting your jobs:+==== Step 7: Start the tree search by submitting your jobs ==== 
    ls -1 *Reconstruction.sh > list_of_shells    ls -1 *Reconstruction.sh > list_of_shells
    for i in `cat list_of_shells`; do qsub $i; done    for i in `cat list_of_shells`; do qsub $i; done
        
-**Step 8:** Map protein domain architecture to each tree and build a pdf file by panther super-family+==== Step 8: Map protein domain architecture to each tree and build a pdf file by panther super-family ==== 
     1) Create an input file separated by tabs containing a list records by line following this format: fastafile treefile     1) Create an input file separated by tabs containing a list records by line following this format: fastafile treefile
     source activate python27-generic     source activate python27-generic
     xvfb-run -a python ETE_standAlone1.4.py Input4ETE     xvfb-run -a python ETE_standAlone1.4.py Input4ETE
     source deactivate     source deactivate
-**Step 9:** Creating a tabulated file to keep track of the findings. +     
 +==== Step 9: Creating a tabulated file to keep track of the findings. ==== 
 + 
    NOTE: you will need a metadata file. it may consist of accession numbers and a fasta header. please see the format of the metadata provided for this example 'RNAreduced.METADATA'    NOTE: you will need a metadata file. it may consist of accession numbers and a fasta header. please see the format of the metadata provided for this example 'RNAreduced.METADATA'
    source activate python36-generic    source activate python36-generic
Line 230: Line 252:
   Error: File existence/permissions problem in trying to open HMM file /db1/extra-data-sets/panther/PANTHER13.1/books/PTHR44316/hmmer.hmm.   Error: File existence/permissions problem in trying to open HMM file /db1/extra-data-sets/panther/PANTHER13.1/books/PTHR44316/hmmer.hmm.
  
-**Step 10:** Move the pdf files and 'MAIN_TABLE.txt' to your desktop for manual tree inspection and orthology assignment.+==== Step 10: Move the pdf files and 'MAIN_TABLE.txt' to your desktop for manual tree inspection and orthology assignment. ==== 
      
  
  
  
ortologs_searches_using_panther_hmmrs.1682618868.txt.gz · Last modified: by 134.190.232.186