ortologs_searches_using_panther_hmmrs
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ortologs_searches_using_panther_hmmrs [2023/04/27 14:46] – 134.190.232.186 | ortologs_searches_using_panther_hmmrs [2023/05/04 10:50] (current) – [Step 6: Parsing HMM outpus and creating inputs for a second HMM search] 134.190.232.186 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ===== Orthology detection for metabolic pathways using genomic data and Panther data===== | + | ====== Orthology detection for metabolic pathways using genomic data and Panther data ====== |
| // | // | ||
| - | ==== Required softwares and scripts ==== | + | ===== Required softwares and scripts |
| - '' | - '' | ||
| Line 15: | Line 15: | ||
| * '' | * '' | ||
| - | ==== Initial information required ==== | + | ===== Initial information required |
| - | === Query sequences in fasta format === | + | ==== Query sequences in fasta format |
| These **MUST** belong to proteins with experimental evidence from **MODEL ORGANISMS**. This information should be gathered through literature searches, as well as download the information from specialized databases. Query sequences from model organisms in which the pathway of interest has been very well studied. | These **MUST** belong to proteins with experimental evidence from **MODEL ORGANISMS**. This information should be gathered through literature searches, as well as download the information from specialized databases. Query sequences from model organisms in which the pathway of interest has been very well studied. | ||
| - | === Databases === | + | ==== Databases |
| **Panther data**\\ | **Panther data**\\ | ||
| Line 52: | Line 52: | ||
| </ | </ | ||
| - | ==== Protocol overview ==== | + | ===== Protocol overview |
| - Blast queries (// | - Blast queries (// | ||
| Line 70: | Line 70: | ||
| - | ==== Detailed Protocol ==== | + | ===== Detailed Protocol |
| - | NOTE: The pathway of interest for this example is the RNA decay pathway, query file name is RNAreduced.seqs and Metadata associated is RNAreduced.METADATA\\ | + | NOTE: The pathway of interest for this example is the **RNA decay pathway**, query file name is '' |
| - | BEFORE YOU START:\\ | + | BEFORE YOU START: Make a working directory, copy the predicted proteomes, queries and metadata, and the scripts for this workflow: |
| - | Make a working directory, copy the predicted proteomes, queries and metadata, and the scripts for this workflow: | + | |
| < | < | ||
| Line 85: | Line 84: | ||
| </ | </ | ||
| - | === Step 1: Blast vs Uniprot103 === | + | ==== Step 1: Blast vs Uniprot103 |
| Run blast search against UniProt103.fasta to find the closest Panther superfamily for each of the queries from model organisms and then parse the blast output by identity and e-value:\\ | Run blast search against UniProt103.fasta to find the closest Panther superfamily for each of the queries from model organisms and then parse the blast output by identity and e-value:\\ | ||
| - | The query file name is **RNAreduced.seqs**, | + | The query file name is **RNAreduced.seqs**, |
| Parse the blast output by identity and e-value: | Parse the blast output by identity and e-value: | ||
| - | Given the queries you are using (see Initial information required) it is highly likely that these sequences will find themselves during the blast search against UniProt103.fasta | + | Given the queries you are using (see //Initial information required//) it is highly likely that these sequences will find themselves during the blast search against |
| - | [For more information on species contained in Panther see the file PTHR_103classification.tsv or visit http:// | + | For more information on species contained in Panther see the file '' |
| - | source activate python36-generic | + | < |
| - | | + | source activate python36-generic |
| - | | + | python BlastParser.by.Pident.py RNA.reduced.blastout |
| + | source deactivate | ||
| + | </ | ||
| - | The resulting file ' | + | The resulting file '' |
| To figure out how many sequences you started with: | To figure out how many sequences you started with: | ||
| grep ‘>’ RNAreduced.seqs | wc -l | grep ‘>’ RNAreduced.seqs | wc -l | ||
| - | | + | # outputs is -> 97 RNAreduced.seqs |
| To figure out how many blast hits output you have: | To figure out how many blast hits output you have: | ||
| wc -l RNA.reduced.blastout | wc -l RNA.reduced.blastout | ||
| - | | + | # outputs is -> 99 RNA.reduced.blastout |
| + | |||
| wc -l RNA.reduced.blastout_pparsed.tab | wc -l RNA.reduced.blastout_pparsed.tab | ||
| - | | + | # outputs is -> 98 RNA.reduced.blastout_pparsed.tab |
| - | Note: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far. | + | |
| + | NOTE: Remember that this file has a header, so the total of blast results is 97, so everything is fine so far. | ||
| If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists. | If your number of queries and blast output results differ: 1) double check your input files for errors in format 2) go online to panther to check if the panther family for the query of your interest has been curated and/or exists. | ||
| - | **Step 2:** Getting the panther | + | ==== Step 2: Getting the PANTHER |
| - | Obtain the codes for each panther super-family | + | |
| + | Obtain the codes for each PANTHER superfamily | ||
| Post-processing the blast output to get panther information: | Post-processing the blast output to get panther information: | ||
| + | |||
| 1) Separate the first column that correspond to the queries accession numbers: | 1) Separate the first column that correspond to the queries accession numbers: | ||
| cut -d $' | cut -d $' | ||
| + | |||
| 2) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv): | 2) The uniprot accession numbers (these numbers will be used to grep PTHR_103classification.tsv): | ||
| cut -d $' | cut -d $' | ||
| + | |||
| 3) Create a file containing both columns for future crosschecking: | 3) Create a file containing both columns for future crosschecking: | ||
| paste -d $' | paste -d $' | ||
| - | 4) | + | |
| + | 4) Remove | ||
| sed -i '/ | sed -i '/ | ||
| - | Getting the information from panther | + | |
| - | Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the PTHR_103classification.tsv: | + | Getting the information from PANTHER |
| + | Now, create a file containing the panther information only for 97 queries, by grepping the hits_acc information from the '' | ||
| 5) create a file containing the panther information 97 queries: | 5) create a file containing the panther information 97 queries: | ||
| grep -w -F -f hits_acc / | grep -w -F -f hits_acc / | ||
| + | |||
| 6) Create a tsv file containing identify the panther families by hits_acc: | 6) Create a tsv file containing identify the panther families by hits_acc: | ||
| cut -d $' | cut -d $' | ||
| + | |||
| 7) eliminate extra tabulations in the file: | 7) eliminate extra tabulations in the file: | ||
| sed -i.bak ' | sed -i.bak ' | ||
| sed -i ' | sed -i ' | ||
| + | |||
| 8) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’: | 8) Creating a cheat sheet for you. Sort and Merge the headerless file ‘query_hits_columns’ with ‘Pantherby97Uniprotaccession.tsv’: | ||
| Line 154: | Line 168: | ||
| - | **Step 3:** Creating shells and running a HMMR search by **PANTHER SUBFAMILY**\\ | + | ==== Step 3: Creating shells and running a HMMR search by PANTHER SUBFAMILY |
| 1) You are ready to create the master shell for the hmmr search: | 1) You are ready to create the master shell for the hmmr search: | ||
| | | ||
| Line 172: | Line 187: | ||
| - | **Step 4:** Parsing HMM outputs and creating inputs for a second HMM search\\ | + | ==== Step 4: Parsing HMM outputs and creating inputs for a second HMM search |
| Parsing the HMM outputs | Parsing the HMM outputs | ||
| | | ||
| Line 185: | Line 201: | ||
| - | **Step 5:** | + | ==== Step 5: Creating shells and running a HMMR search |
| 1) You are ready to create the master shell for the hmmr search by **PANTHER SUPERFAMILY**: | 1) You are ready to create the master shell for the hmmr search by **PANTHER SUPERFAMILY**: | ||
| Line 196: | Line 212: | ||
| qsub Panther.HmmrSearch.sh2 | qsub Panther.HmmrSearch.sh2 | ||
| - | **Step 6:** Parsing HMM outpus | + | ==== Step 6: Parsing HMM outputs |
| Parsing the HMM outputs | Parsing the HMM outputs | ||
| | | ||
| Line 211: | Line 228: | ||
| 7) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script | 7) Create an input file (Input4ETE) to be later applied with the ETE_standAlone1.4.py script | ||
| - | **Step 6:** Start the tree search by submitting your jobs: | + | ==== Step 7: Start the tree search by submitting your jobs ==== |
| ls -1 *Reconstruction.sh > list_of_shells | ls -1 *Reconstruction.sh > list_of_shells | ||
| for i in `cat list_of_shells`; | for i in `cat list_of_shells`; | ||
| - | **Step 8:** Map protein domain architecture to each tree and build a pdf file by panther super-family | + | ==== Step 8: Map protein domain architecture to each tree and build a pdf file by panther super-family |
| 1) Create an input file separated by tabs containing a list records by line following this format: fastafile treefile | 1) Create an input file separated by tabs containing a list records by line following this format: fastafile treefile | ||
| source activate python27-generic | source activate python27-generic | ||
| xvfb-run -a python ETE_standAlone1.4.py Input4ETE | xvfb-run -a python ETE_standAlone1.4.py Input4ETE | ||
| source deactivate | source deactivate | ||
| - | **Step 9:** Creating a tabulated file to keep track of the findings. | + | |
| + | ==== Step 9: Creating a tabulated file to keep track of the findings. | ||
| + | |||
| NOTE: you will need a metadata file. it may consist of accession numbers and a fasta header. please see the format of the metadata provided for this example ' | NOTE: you will need a metadata file. it may consist of accession numbers and a fasta header. please see the format of the metadata provided for this example ' | ||
| | | ||
| Line 231: | Line 252: | ||
| Error: File existence/ | Error: File existence/ | ||
| - | **Step 10:** Move the pdf files and ' | + | ==== Step 10: Move the pdf files and ' |
| | | ||
ortologs_searches_using_panther_hmmrs.1682617595.txt.gz · Last modified: by 134.190.232.186
