User Tools

Site Tools


ortologs_searches_using_panther_hmmrs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ortologs_searches_using_panther_hmmrs [2023/04/27 15:10] – [Step 1: Blast vs Uniprot103] 134.190.232.186ortologs_searches_using_panther_hmmrs [2023/05/04 10:50] (current) – [Step 6: Parsing HMM outpus and creating inputs for a second HMM search] 134.190.232.186
Line 93: Line 93:
 For more information on species contained in Panther see the file ''PTHR_103classification.tsv'' or visit http://www.pantherdb.org/. For more information on species contained in Panther see the file ''PTHR_103classification.tsv'' or visit http://www.pantherdb.org/.
  
-   source activate python36-generic +<code> 
-   python BlastParser.by.Pident.py RNA.reduced.blastout +source activate python36-generic 
-   source deactivate+python BlastParser.by.Pident.py RNA.reduced.blastout 
 +source deactivate 
 +</code>
  
-The resulting file 'RNA.reduced.blastout_pparsed.tab' should contain one result by query. To check this, use the following commands:+The resulting file ''RNA.reduced.blastout_pparsed.tab'' should contain one result by query. To check this, use the following commands:
  
 To figure out how many sequences you started with: To figure out how many sequences you started with:
    grep ‘>’ RNAreduced.seqs | wc -l    grep ‘>’ RNAreduced.seqs | wc -l
-   outputs is -> 97 RNAreduced.seqs+   outputs is -> 97 RNAreduced.seqs 
 To figure out how many blast hits output you have: To figure out how many blast hits output you have:
    wc -l RNA.reduced.blastout    wc -l RNA.reduced.blastout
-   outputs is -> 99 RNA.reduced.blastout+   outputs is -> 99 RNA.reduced.blastout 
 +   
    wc -l RNA.reduced.blastout_pparsed.tab    wc -l RNA.reduced.blastout_pparsed.tab
-   outputs is -> 98 RNA.reduced.blastout_pparsed.tab +   outputs is -> 98 RNA.reduced.blastout_pparsed.tab 
-Note: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far.+   
 +NOTE: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far.
 If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists. If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists.
  
-==== Step 2: Getting the panther codes ====+==== Step 2: Getting the PANTHER codes ====
  
-Obtain the codes for each panther super-family and subfamily for each query by using the command commands below and the PTHR_103classification.tsv to create a customized file for your queries.+Obtain the codes for each PANTHER superfamily and subfamily for each query by using the command commands below and the ''PTHR_103classification.tsv'' to create a customized file for your queries.
  
 Post-processing the blast output to get panther information: Post-processing the blast output to get panther information:
 +
 1) Separate the first column that correspond to the queries accession numbers: 1) Separate the first column that correspond to the queries accession numbers:
    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f1 > queries_acc    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f1 > queries_acc
 +
 2) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv): 2) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv):
    cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f2|cut -d '|' -f2 > hits_acc      cut -d $'\t' RNA.reduced.blastout_pparsed.tab -f2|cut -d '|' -f2 > hits_acc  
 +
 3) Create a file containing both columns for future crosschecking: 3) Create a file containing both columns for future crosschecking:
    paste -d $'\t' queries_acc hits_acc > query_hits_columns    paste -d $'\t' queries_acc hits_acc > query_hits_columns
-4)  remove the header of query_hits_columns:+ 
 +4) Remove the header of query_hits_columns:
     sed -i '/query ID\tsubject ID/d' query_hits_columns     sed -i '/query ID\tsubject ID/d' query_hits_columns
-Getting the information from panther classification: + 
-Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the PTHR_103classification.tsv:+Getting the information from PANTHER classification: 
 +Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the ''PTHR_103classification.tsv'': 
 5) create a file containing the panther information 97 queries: 5) create a file containing the panther information 97 queries:
    grep -w -F -f hits_acc /scratch3/rogerlab_databases/other_dbs/PTHR_103classification.tsv > Panther97queries_hit_info.tsv      grep -w -F -f hits_acc /scratch3/rogerlab_databases/other_dbs/PTHR_103classification.tsv > Panther97queries_hit_info.tsv  
 +
 6) Create a tsv file containing identify the panther families by hits_acc: 6) Create a tsv file containing identify the panther families by hits_acc:
    cut -d $'\t' -f1,2,3 Panther97queries_hit_info.tsv |cut -d '|' -f3,4|cut -d '=' -f2 > Pantherby97Uniprotaccession.tsv    cut -d $'\t' -f1,2,3 Panther97queries_hit_info.tsv |cut -d '|' -f3,4|cut -d '=' -f2 > Pantherby97Uniprotaccession.tsv
 +
 7) eliminate extra tabulations in the file: 7) eliminate extra tabulations in the file:
    sed -i.bak 's/\t\t/\t/g' Pantherby97Uniprotaccession.tsv    sed -i.bak 's/\t\t/\t/g' Pantherby97Uniprotaccession.tsv
    sed -i 's/:/_/g' Pantherby97Uniprotaccession.tsv    sed -i 's/:/_/g' Pantherby97Uniprotaccession.tsv
 +
 8) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’: 8) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’:
  
Line 173: Line 187:
        
  
-==== Step 4:** Parsing HMM outputs and creating inputs for a second HMM search ====+==== Step 4: Parsing HMM outputs and creating inputs for a second HMM search ====
  
 Parsing the HMM outputs Parsing the HMM outputs
Line 198: Line 212:
    qsub Panther.HmmrSearch.sh2    qsub Panther.HmmrSearch.sh2
    
-==== Step 6: Parsing HMM outpus and creating inputs for a second HMM search ====+==== Step 6: Parsing HMM outputs and creating inputs for a second HMM search ====
  
 Parsing the HMM outputs Parsing the HMM outputs
Line 214: Line 228:
    7) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script    7) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script
        
-==== Step 6: Start the tree search by submitting your jobs ====+==== Step 7: Start the tree search by submitting your jobs ====
  
    ls -1 *Reconstruction.sh > list_of_shells    ls -1 *Reconstruction.sh > list_of_shells
ortologs_searches_using_panther_hmmrs.1682619015.txt.gz · Last modified: by 134.190.232.186