This is an old revision of the document!
Table of Contents
Finding genes of interest with PANTHER HMMs
By Joran Martijn (Sep 2023)
The PANTHER database
The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. The goal of PANTHER is to classify proteins by function.
The PANTHER initiative started in 1998 and was the first online database that included protein phylogenetic trees. Initially, very few species had whole genome sequences available but starting with version 6 (released in 2006), many more species were added. PANTHER16 (or 17?) included 142 fully sequenced genomes, including 19 vertebrates, 15 invertebrates, 14 fungi, 40 plants, 11 other eukaryotes, 8 archaea and 35 bacteria. The list is thus fairly limited and, more focused towards non-protist eukaryotes.
In 2002, PANTHER3 had 271779 proteins over 6155 families. In 2020, PANTHER16 had 2063337 proteins over 15635 families.
PANTHER families are defined as clusters of related proteins for which a good multiple sequence alignment can be made (defined in their original 2003 paper ). The emphasis is thus not so much on trying to reconstruct exact, complete orthogroups, but rather groupings that are useful to predict protein function on new protein sequences.
A key aspect of the PANTHER database is the division of families into multiple subfamilies. PANTHER's focus is on function and to a lesser degree evolutionary history, hence they attempt to classify proteins with distinct functions into distinct subfamilies. The idea is thus that each different subfamily has a distinct function.
The splitting of families into subfamilies was initially I think mostly done manually by trained biologists. However, in the most recent versions, it seems to be a mostly automated process but with the results still curated by trained biologists. In the 2021 paper a subfamily is defined as follows
A subfamily roughly corresponds to a group of least diverged orthologs, in that most members of each subfamily are mutually least diverged orthologs of all other members of the subfamily
I find this definition a bit cryptic, but it becomes a bit clearer with the visual example that they provide in the 2021 paper.
Here, sequences from distinct subfamilies are given a different color (blue, red, brown, pink, green). Speciation events are marked with green circles, duplication events with orange circles. Triangles are collapsed nodes, and diamonds indicate the root nodes of distinct subfamilies.
What becomes immediately apparent is that subfamilies are not necessary monophyletic! For example, the blue subfamily are polyphyletic (or paraphyletic?).
This is because of a duplication event that happened in the bottom daughter of the global root node. Two branches flow from the duplication event: a shorter one and a longer one. In PANTHERs definition, the sequences emerging from the shorter one comprise the least diverged orthologs (orthologs relative to the DANRE/ORYLA lineage on the top of the figure), and thus they belong to the same subfamily as the DANRE/ORYLA proteins. The sequences emerging from the longer one are thus NOT the least diverged, and hence they belong to a distinct subfamily.
Thus, any time a duplication event happens, a new subfamily is spawned. The exception is when the duplication leads to in-paralogs, or technically more specifically, to within-genome paralogs. That is why the duplication to the ORYLA and ORNAN copies did not spawn new subfamilies. The implication here is thus that after a duplication, one of the copies evolves to assume a new functional role.
Each family and subfamily in the database has associated with it a MSA, a tree and an HMM. The HMMs are useful to detect orthologs of these families and subfamilies in your pet genome. It is also useful to compare the E-values / bit-scores between families and subfamilies. If a novel predicted protein from your pet genome scores more highly against a family than a subfamily, it may indicate that your protein represents a novel subfamily.
PANTHER on Perun
Currently, the PANTHER17 database is stored on Perun here /scratch3/rogerlab_databases/other_dbs/PANTHER17.0/. It contains two subdirectories, books/ and globals/. The globals directory contains the actual database of HMMs, binHmm.* files, and names.tab, a file containing the actual annotation of each family and subfamily. Let's take a closer look:
PTHR10000.mag.mod PHOSPHOSERINE PHOSPHATASE PTHR10000.SF55.mod 5-AMINO-6-(5-PHOSPHO-D-RIBITYLAMINO)URACIL PHOSPHATASE YCSE PTHR10000.SF23.mod 5-AMINO-6-(5-PHOSPHO-D-RIBITYLAMINO)URACIL PHOSPHATASE YITU PTHR10000.SF57.mod KANOSAMINE-6-PHOSPHATE PHOSPHATASE ... ... PTHR10000.SF56.mod MANNOSYL-3-PHOSPHOGLYCERATE PHOSPHATASE PTHR10003.mag.mod SUPEROXIDE DISMUTASE CU-ZN -RELATED PTHR10003.SF95.mod SUPEROXIDE DISMUTASE [CU-ZN] 2, CHLOROPLASTIC PTHR10003.SF82.mod SOD_CU DOMAIN-CONTAINING PROTEIN PTHR10003.SF94.mod EXTRACELLULAR SUPEROXIDE DISMUTASE [CU-ZN] 3 PTHR10003.SF31.mod SUPEROXIDE DISMUTASE [CU-ZN] 3 ... ...
Those ending with .mag.mod are families, and those ending with .SF[number].mod are their respective subfamilies.
The Light_panther17.tsv file is a simple list of all families.
Searching your genome for PANTHER families homologs
Do an hmmscan
Simply do an hmmscan from the HMMER3 package. hmmscan takes as a query a single or a list of protein sequences and compares them with a database of HMM profiles. Below, we are taking the entire PANTHER17 database of HMMs as a database.
hmmscan \ -E 0.001 --incE 0.0001 \ --domE 0.01 --incdomE 0.01 \ --notextw --cpu 10 \ --tblout ergo_vs_panther17.tblout --domtblout ergo_vs_panther17.domtblout binHmm Ergobibamus_cyprinoides_CL.proteins.fa
This will yield two outfiles, the ergo_vs_panther17.tblout and ergo_vs_panther17.domtblout.
The .tblout file is a tabular output file where each line has a unique query (a protein sequence) and target (the HMM profile) combination. The overall E-value and score between HMM and sequence are reported (“full sequence”), as well as the segment of the protein that scored the highest with said HMM (“best 1 domain”), as well as a bunch of metrics for estimating the number of “domains”.
# --- full sequence ---- --- best 1 domain ---- --- domain number estimation ---- # target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target # ------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- --------------------- PTHR31752.SF18.pir - FUN_000001-T1 - 6.8e-23 83.9 5.0 3.3e-12 48.7 2.4 2.1 1 1 1 2 2 2 2 - PTHR36838.orig.30.pir - FUN_000001-T1 - 5.6e-22 81.0 13.0 7e-22 80.7 1.8 2.2 2 0 0 2 2 2 2 - PTHR36838.SF1.pir - FUN_000001-T1 - 1.9e-19 72.8 4.4 1.9e-19 72.8 4.4 2.4 2 0 0 2 2 1 1 - PTHR31752.orig.30.pir - FUN_000001-T1 - 3.3e-16 62.0 4.6 2.6e-14 55.7 4.6 2.2 1 1 0 1 1 1 1 - PTHR36838.SF3.pir - FUN_000001-T1 - 1.4e-12 50.3 3.7 2.6e-12 49.3 0.1 2.2 2 0 0 2 2 1 1 - PTHR31752.SF40.pir - FUN_000001-T1 - 1.2e-11 47.3 0.0 2e-11 46.6 0.0 1.3 1 0 0 1 1 1 1 - PTHR31752.SF2.pir - FUN_000001-T1 - 5.5e-09 38.3 0.0 7e-09 38.0 0.0 1.2 1 0 0 1 1 1 1 - PTHR31752.SF49.pir - FUN_000001-T1 - 3.4e-08 35.9 0.0 5.9e-08 35.2 0.0 1.4 1 0 0 1 1 1 1 - ...
NOTE that the outfile uses the word “domain” because traditionally HMM profiles describe protein domains. The Pfam is a good example of a database of domain HMM profiles. PANTHER HMM describes entire proteins. Hence the term 'domain' is not very well suited here, and can be a bit confusing.
NOTE also that the names of the families are slightly different here. Instead of ending with .mag.mod or .SF[number].mod, they end with .orig.30.pir and .SF[number].pir.
A protein sequence may have multiple segments that each match well with a particular HMM profile independently. To see each of these alignments separately, we can inspect the .domtblout file:
# --- full sequence --- -------------- this domain ------------- hmm coord ali coord env coord # target name accession tlen query name accession qlen E-value score bias # of c-Evalue i-Evalue score bias from to from to from to acc description of target # ------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- --------------------- PTHR31752.SF18.pir - 418 FUN_000001-T1 - 412 6.8e-23 83.9 5.0 1 2 1.9e-16 3.3e-12 48.7 2.4 9 167 10 174 4 201 0.78 - PTHR31752.SF18.pir - 418 FUN_000001-T1 - 412 6.8e-23 83.9 5.0 2 2 1.4e-12 2.4e-08 35.9 0.0 261 369 210 350 182 404 0.72 - PTHR36838.orig.30.pir - 308 FUN_000001-T1 - 412 5.6e-22 81.0 13.0 1 2 4e-26 7e-22 80.7 1.8 10 214 15 272 9 281 0.82 - PTHR36838.orig.30.pir - 308 FUN_000001-T1 - 412 5.6e-22 81.0 13.0 2 2 0.0028 50 5.6 3.8 222 276 310 367 301 398 0.80 - PTHR36838.SF1.pir - 308 FUN_000001-T1 - 412 1.9e-19 72.8 4.4 1 1 1.1e-23 1.9e-19 72.8 4.4 9 211 14 269 8 276 0.82 - PTHR31752.orig.30.pir - 405 FUN_000001-T1 - 412 3.3e-16 62.0 4.6 1 1 1.4e-18 2.6e-14 55.7 4.6 10 340 10 330 3 402 0.73 - PTHR36838.SF3.pir - 312 FUN_000001-T1 - 412 1.4e-12 50.3 3.7 1 1 1.5e-16 2.6e-12 49.3 0.1 10 216 16 272 10 279 0.77 - PTHR31752.SF40.pir - 360 FUN_000001-T1 - 412 1.2e-11 47.3 0.0 1 1 1.1e-15 2e-11 46.6 0.0 37 267 36 274 8 282 0.75 - PTHR31752.SF2.pir - 349 FUN_000001-T1 - 412 5.5e-09 38.3 0.0 1 1 4e-13 7e-09 38.0 0.0 103 284 105 332 60 400 0.70 - ...
Now, each line represents a unique “domain” (i.e. segment of a protein sequence) and HMM profile combination. The overall E-value and score between HMM and the full sequence is still reported, as well as the scores of the specific “domains” that were found for each sequence.
Add HMM profile coverage information
You may want to reduce the number of false positives (i.e. protein sequences that match with an HMM but are not really orthologs) by discarding queries that only have partial alignments with the HMM. With BLAST I like to use the fields qcovhsp, qcovs, which describe how much of the query was covered by the HSP (High scoring Segment Pair, or in simpler terms, the alignment between query and hit) (qcovhsp) or collection of alignments between the same query and hit (qcovs). For example, if an alignment covered less than, say 60% of the length of the query, I may consider it a false positive hit. Unfortunately, HMMER does not directly report such metrics, but it does report the metrics necessary to calculate them. I wrote a python script, extend_hmmer_domtblout.py that does exactly this:
extend_hmmer_domtblout.py -m hmmscan -i ergo_vs_panther17.domtblout -o ergo_vs_panther17.ext.domtblout
This yields a file that looks like this
TargetName TargetAcc TargetLen QueryName QueryAcc QueryLen seq-E-value seq-Score seq-Bias dom-No dom-Count dom-c-Evalue dom-i-Evalue dom-Score dom-Bias hmm-From hmm-To ali-From ali-To env-From env-To mean-PP DescriptionOfTarget qcovhsp tcovhsp qcovs tcovs PTHR31752.SF18.pir - 418 FUN_000001-T1 - 412 6.8e-23 83.9 5.0 1 2 1.9e-16 3.3e-12 48.7 2.4 9 167 10 174 4 201 0.78 - 0.40 0.38 0.74 0.64 PTHR31752.SF18.pir - 418 FUN_000001-T1 - 412 6.8e-23 83.9 5.0 2 2 1.4e-12 2.4e-08 35.9 0.0 261 369 210 350 182 404 0.72 - 0.34 0.26 0.74 0.64 PTHR36838.orig.30.pir - 308 FUN_000001-T1 - 412 5.6e-22 81.0 13.0 1 2 4e-26 7e-22 80.7 1.8 10 214 15 272 9 281 0.82 - 0.63 0.67 0.77 0.84 PTHR36838.orig.30.pir - 308 FUN_000001-T1 - 412 5.6e-22 81.0 13.0 2 2 0.0028 50.0 5.6 3.8 222 276 310 367 301 398 0.8 - 0.14 0.18 0.77 0.84 PTHR36838.SF1.pir - 308 FUN_000001-T1 - 412 1.9e-19 72.8 4.4 1 1 1.1e-23 1.9e-19 72.8 4.4 9 211 14 269 8 276 0.82 - 0.62 0.66 0.62 0.66 PTHR31752.orig.30.pir - 405 FUN_000001-T1 - 412 3.3e-16 62.0 4.6 1 1 1.4e-18 2.6e-14 55.7 4.6 10 340 10 330 3 402 0.73 - 0.78 0.82 0.78 0.82 PTHR36838.SF3.pir - 312 FUN_000001-T1 - 412 1.4e-12 50.3 3.7 1 1 1.5e-16 2.6e-12 49.3 0.1 10 216 16 272 10 279 0.77 - 0.62 0.66 0.62 0.66 PTHR31752.SF40.pir - 360 FUN_000001-T1 - 412 1.2e-11 47.3 0.0 1 1 1.1e-15 2e-11 46.6 0.0 37 267 36 274 8 282 0.75 - 0.58 0.64 0.58 0.64 PTHR31752.SF2.pir - 349 FUN_000001-T1 - 412 5.5e-09 38.3 0.0 1 1 4e-13 7e-09 38.0 0.0 103 284 105 332 60 400 0.7 - 0.55 0.52 0.55 0.52 PTHR31752.SF49.pir - 358 FUN_000001-T1 - 412 3.4e-08 35.9 0.0 1 1 3.3e-12 5.9e-08 35.2 0.0 39 309 38 319 6 370 0.72 - 0.68 0.76 0.68 0.76 PTHR31913.orig.30.pir - 655 FUN_000003-T1 - 778 1e-54 189.4 0.0 1 1 7.2e-59 2e-54 188.4 0.0 203 649 350 776 296 778 0.83 - 0.55 0.68 0.55 0.68 PTHR31913.SF0.pir - 659 FUN_000003-T1 - 778 8e-54 186.4 0.0 1 1 5.1e-58 1.4e-53 185.6 0.0 313 653 441 776 332 778 0.88 - 0.43 0.52 0.43 0.52 PTHR31913.SF7.pir - 619 FUN_000003-T1 - 778 7.6e-15 57.4 0.0 1 1 4.9e-19 1.4e-14 56.6 0.0 325 551 516 730 506 739 0.85 - 0.28 0.37 0.28 0.37 ...
The script does a few things. It reformats the header line, and it makes the whole table tab delimited, which makes it easier to parse later on. Most importantly, it adds 4 new columns, qcovhsp, tcovhsp, qcovs and tcovs.
qcovhsp how much of the query (the protein sequence) is covered by the HMM:protein alignment tcovhsp how much of the target (the HMM profile) is covered by the HMM:protein alignment qcovs how much of the query (the protein sequence) is covered _in_total_ by all HMM:protein alignments between this HMM and this protein tcovs how much of the target (the HMM profile) is covered _in_total_ by all HMM:protein alignments between this HMM and this protein
Add PANTHER annotation
Knowing which families and subfamilies had hits in your genome is nice and all, but it's just a bunch of unintelligible identifiers. I wrote another script that links these identifiers with their actual annotations (this linking information is found in names.tab and adds the actual annotations to the .domtblout file
add_pthr_annotation_to_hmmer_out.py -i ergo_vs_panther17.ext.domtblout -n names.tab > ergo_vs_panther17.annot.ext.domtblout
The output file will look something like this
TargetName TargetAcc TargetLen QueryName QueryAcc QueryLen seq-E-value seq-Score seq-Bias dom-No dom-Count dom-c-Evalue dom-i-Evalue dom-Score dom-Bias hmm-From hmm-To ali-From ali-To env-From env-To mean-PP DescriptionOfTarget qcovhsp tcovhsp qcovs tcovs annotation PTHR31752.SF18.pir - 418 FUN_000001-T1 - 412 6.8e-23 83.9 5.0 1 2 1.9e-16 3.3e-12 48.7 2.4 9 167 10 174 4 201 0.78 - 0.40 0.38 0.74 0.64 AUXIN_EFFLUX_CARRIER_COMPONENT_3A PTHR31752.SF18.pir - 418 FUN_000001-T1 - 412 6.8e-23 83.9 5.0 2 2 1.4e-12 2.4e-08 35.9 0.0 261 369 210 350 182 404 0.72 - 0.34 0.26 0.74 0.64 AUXIN_EFFLUX_CARRIER_COMPONENT_3A PTHR36838.orig.30.pir - 308 FUN_000001-T1 - 412 5.6e-22 81.0 13.0 1 2 4e-26 7e-22 80.7 1.8 10 214 15 272 9 281 0.82 - 0.63 0.67 0.77 0.84 AUXIN_EFFLUX_CARRIER_FAMILY_PROTEIN PTHR36838.orig.30.pir - 308 FUN_000001-T1 - 412 5.6e-22 81.0 13.0 2 2 0.0028 50.0 5.6 3.8 222 276 310 367 301 398 0.8 - 0.14 0.18 0.77 0.84 AUXIN_EFFLUX_CARRIER_FAMILY_PROTEIN PTHR36838.SF1.pir - 308 FUN_000001-T1 - 412 1.9e-19 72.8 4.4 1 1 1.1e-23 1.9e-19 72.8 4.4 9 211 14 269 8 276 0.82 - 0.62 0.66 0.62 0.66 SLR1864_PROTEIN PTHR31752.orig.30.pir - 405 FUN_000001-T1 - 412 3.3e-16 62.0 4.6 1 1 1.4e-18 2.6e-14 55.7 4.6 10 340 10 330 3 402 0.73 - 0.78 0.82 0.78 0.82 AUXIN_EFFLUX_CARRIER_COMPONENT_1B-RELATED PTHR36838.SF3.pir - 312 FUN_000001-T1 - 412 1.4e-12 50.3 3.7 1 1 1.5e-16 2.6e-12 49.3 0.1 10 216 16 272 10 279 0.77 - 0.62 0.66 0.62 0.66 AUXIN_EFFLUX_CARRIER_FAMILY_PROTEIN PTHR31752.SF40.pir - 360 FUN_000001-T1 - 412 1.2e-11 47.3 0.0 1 1 1.1e-15 2e-11 46.6 0.0 37 267 36 274 8 282 0.75 - 0.58 0.64 0.58 0.64 AUXIN_EFFLUX_CARRIER_COMPONENT_8 PTHR31752.SF2.pir - 349 FUN_000001-T1 - 412 5.5e-09 38.3 0.0 1 1 4e-13 7e-09 38.0 0.0 103 284 105 332 60 400 0.7 - 0.55 0.52 0.55 0.52 AUXIN_EFFLUX_CARRIER_COMPONENT_5 PTHR31752.SF49.pir - 358 FUN_000001-T1 - 412 3.4e-08 35.9 0.0 1 1 3.3e-12 5.9e-08 35.2 0.0 39 309 38 319 6 370 0.72 - 0.68 0.76 0.68 0.76 AUXIN_EFFLUX_CARRIER_COMPONENT PTHR31913.orig.30.pir - 655 FUN_000003-T1 - 778 1e-54 189.4 0.0 1 1 7.2e-59 2e-54 188.4 0.0 203 649 350 776 296 778 0.83 - 0.55 0.68 0.55 0.68 VACUOLAR_IMPORT_AND_DEGRADATION_PROTEIN_27 PTHR31913.SF0.pir - 659 FUN_000003-T1 - 778 8e-54 186.4 0.0 1 1 5.1e-58 1.4e-53 185.6 0.0 313 653 441 776 332 778 0.88 - 0.43 0.52 0.43 0.52 VACUOLAR_IMPORT_AND_DEGRADATION_PROTEIN_27 PTHR31913.SF7.pir - 619 FUN_000003-T1 - 778 7.6e-15 57.4 0.0 1 1 4.9e-19 1.4e-14 56.6 0.0 325 551 516 730 506 739 0.85 - 0.28 0.37 0.28 0.37 DEM_PROTEIN PTHR31913.SF5.pir - 613 FUN_000003-T1 - 778 2.3e-12 49.0 0.1 1 1 5.5e-16 1.5e-11 46.2 0.1 231 539 442 731 434 740 0.78 - 0.37 0.50 0.37 0.50 VID27_DOMAIN-CONTAINING_PROTEIN PTHR31913.SF2.pir - 622 FUN_000003-T1 - 778 4.1e-08 35.1 0.0 1 1 1.4e-11 3.9e-07 31.9 0.0 325 553 516 729 506 738 0.77 - 0.28 0.37 0.28 0.37 OS02G0789600_PROTEIN PTHR23359.SF250.pir - 270 FUN_000004-T1 - 520 7.6e-74 251.2 0.0 1 2 3.2e-74 7.2e-71 241.4 0.0 59 265 48 262 21 267 0.93 - 0.41 0.77 0.45 0.77 KINASE,_PUTATIVE-RELATED ...
Filter by HMM profile coverage and composition bias
You may only want to keep proteins of your pet genome that are (nearly) fully covered by PANTHER (sub)families to reduce the number of false positives.
I did this by only keeping lines of the .domtblout where the tcovs (the coverage of the target accumulated over all protein:HMM-profile alignments; the 27th column) is at least 50% of the HMM-profile length, and the seq-Bias score (9th column) is less than 90% of the total bit seq-Score (8th column).
After that, the table is sorted by protein sequence identifier and only the best hit, according to seq-Score remains.
function body {
IFS= read -r header
printf '%s\n' "$header"
"$@"
}
cat ergo_vs_panther17.annot.ext.domtblout \
| body awk '$27>=0.50 && $9/$8<0.9' \
| body sort -k4,4 -k8,8nr \
| body sort -k4,4 -u \
> ergo_vs_panther17.filt.ext.domtblout
The body function is a nifty trick to only apply your awk or sort operations on the body of the table and leave the header intact.
Our new file looks as follows:
TargetName TargetAcc TargetLen QueryName QueryAcc QueryLen seq-E-value seq-Score seq-Bias dom-No dom-Count dom-c-Evalue dom-i-Evalue dom-Score dom-Bias hmm-From hmm-To ali-From ali-To env-From env-To mean-PP DescriptionOfTarget qcovhsp tcovhsp qcovs tcovs annotation PTHR31752.SF18.pir - 418 FUN_000001-T1 - 412 6.8e-23 83.9 5.0 1 2 1.9e-16 3.3e-12 48.7 2.4 9 167 10 174 4 201 0.78 - 0.40 0.38 0.74 0.64 AUXIN_EFFLUX_CARRIER_COMPONENT_3A PTHR31913.orig.30.pir - 655 FUN_000003-T1 - 778 1e-54 189.4 0.0 1 1 7.2e-59 2e-54 188.4 0.0 203 649 350 776 296 778 0.83 - 0.55 0.68 0.55 0.68 VACUOLAR_IMPORT_AND_DEGRADATION_PROTEIN_27 PTHR23359.SF250.pir - 270 FUN_000004-T1 - 520 7.6e-74 251.2 0.0 1 2 3.2e-74 7.2e-71 241.4 0.0 59 265 48 262 21 267 0.93 - 0.41 0.77 0.45 0.77 KINASE,_PUTATIVE-RELATED PTHR45991.orig.30.pir - 450 FUN_000005-T1 - 544 9.4e-99 334.2 0.0 1 1 2e-101 1.3e-98 333.8 0.0 32 438 18 505 11 520 0.86 - 0.90 0.90 0.90 0.90 PACHYTENE_CHECKPOINT_PROTEIN_2 PTHR13563.orig.30.pir - 346 FUN_000006-T1 - 343 8.1e-52 179.5 0.0 1 1 3.6e-56 1e-51 179.1 0.0 63 321 43 320 20 343 0.79 - 0.81 0.75 0.81 0.75 TRNA_(GUANINE-9-)_METHYLTRANSFERASE PTHR11685.SF436.pir - 285 FUN_000007-T1 - 589 2.4e-19 72.8 8.5 1 1 5.8e-23 2.6e-19 72.6 8.5 76 240 241 491 132 578 0.7 - 0.43 0.58 0.43 0.58 RBR-TYPE_E3_UBIQUITIN_TRANSFERASE PTHR46233.orig.30.pir - 231 FUN_000011-T1 - 208 1.7e-48 167.7 0.0 1 1 5.2e-52 2.1e-48 167.4 0.0 20 214 11 206 3 207 0.94 - 0.94 0.84 0.94 0.84 HYDROXYACYLGLUTATHIONE_HYDROLASE_GLOC PTHR23102.SF19.pir - 206 FUN_000013-T1 - 233 1e-31 113.1 9.9 1 1 1.1e-35 1.3e-31 112.7 9.9 58 172 38 151 31 186 0.9 - 0.49 0.56 0.49 0.56 CLEAVAGE_AND_POLYADENYLATION_SPECIFICITY_FACTOR_SUBUNIT_4-LIKE_PROTEIN-RELATED PTHR47979.SF50.pir - 216 FUN_000014-T1 - 206 1.6e-105 353.9 0.0 1 1 8.5e-108 1.9e-105 353.7 0.0 1 193 1 193 1 204 0.98 - 0.94 0.89 0.94 0.89 RAS-RELATED_PROTEIN_RAB-2B PTHR17605.orig.30.pir - 732 FUN_000016-T1 - 712 4.8e-175 586.8 0.0 1 1 1e-177 6.9e-175 586.3 0.0 151 729 100 710 92 712 0.92 - 0.86 0.79 0.86 0.79 RIBOSOME_BIOGENESIS_PROTEIN_BOP1__BLOCK_OF_PROLIFERATION_1_PROTEIN ...
Each of the protein queries in this list has a significant hit against a PANTHER family, and all the alignments between this protein and the HMM-profile cumulatively cover at least 50% of the HMM-profile.
Selecting (sub)families of interest
Until now we have considered all predicted proteins of our pet genome. But perhaps we are only particularly interested for whatever reason in a subset of them.
For example, we want to know what kind of energy conservation genes are present in our genome. From studying the literature, I've compiled in families_of_interest.ATP.upd.list the following list of PANTHER families that seem to be involved in ATP synthesis. The table is formatted like “category long name”, “category short name”, “gene or protein name”, “PANTHER family identifier (names.tab)”, “PANTHER family annotation”, “PANTHER family identifier (binHmm database)”.
ENERGY_CONSERVATION ATP ME PTHR23406.mag.mod MALIC_ENZYME-RELATED PTHR23406.orig.30.pir ENERGY_CONSERVATION ATP ME PTHR23406.SF32.mod NADP-DEPENDENT_MALIC_ENZYME PTHR23406.SF32.pir ENERGY_CONSERVATION ATP PNO/PFO PTHR32154.mag.mod PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED PTHR32154.orig.30.pir ENERGY_CONSERVATION ATP PNO/PFO PTHR32154.SF0.mod PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED PTHR32154.SF0.pir ENERGY_CONSERVATION ATP Adrenodoxin/Ferredoxin_Yah1 PTHR23426.mag.mod FERREDOXIN/ADRENODOXIN PTHR23426.orig.30.pir ENERGY_CONSERVATION ATP Adrenodoxin/Ferredoxin_Yah1 PTHR23426.SF65.mod FERREDOXIN_1-RELATED PTHR23426.SF65.pir ENERGY_CONSERVATION ATP FdxR/Arh1 PTHR11938.mag.mod FAD_NADPH_DEHYDROGENASE/OXIDOREDUCTASE PTHR11938.orig.30.pir ENERGY_CONSERVATION ATP FdxR/Arh1 PTHR11938.SF91.mod NADPH:ADRENODOXIN_OXIDOREDUCTASE,_MITOCHONDRIAL PTHR11938.SF91.pir ENERGY_CONSERVATION ATP HydA PTHR11615.mag.mod NITRATE,_FORMATE,_IRON_DEHYDROGENASE PTHR11615.orig.30.pir ENERGY_CONSERVATION ATP HydA-1 PTHR11615.SF345.mod CATALYTIC_SUBUNIT_OF_IRON-ONLY_HYDROGENASE-RELATED PTHR11615.SF345.pir ENERGY_CONSERVATION ATP HydA-4 PTHR11615.SF304.mod IRON_HYDROGENASE,_PUTATIVE-RELATED PTHR11615.SF304.pir ENERGY_CONSERVATION ATP HydE PTHR43726.mag.mod 3-METHYLORNITHINE_SYNTHASE PTHR43726.orig.30.pir ENERGY_CONSERVATION ATP HydE PTHR43726.SF1.mod [FEFE]_HYDROGENASE_MATURASE_SUBUNIT_HYDE PTHR43726.SF1.pir ENERGY_CONSERVATION ATP HydF PTHR42714.mag.mod TRNA_MODIFICATION_GTPASE_GTPBP3 PTHR42714.orig.30.pir ENERGY_CONSERVATION ATP HydF PTHR42714.SF6.mod TRANSLATION_INITIATION_FACTOR_IF-2 PTHR42714.SF6.pir ENERGY_CONSERVATION ATP HydG PTHR43583.mag.mod 2-IMINOACETATE_SYNTHASE PTHR43583.orig.30.pir ENERGY_CONSERVATION ATP HydG PTHR43583.SF2.mod BIOTIN_AND_THIAMIN_SYNTHESIS_ASSOCIATED_DOMAIN_CONTAINING_PROTEIN PTHR43583.SF2.pir ENERGY_CONSERVATION ATP NuoF/Complex_I_51kDa PTHR11780.mag.mod NADH-UBIQUINONE_OXIDOREDUCTASE_FLAVOPROTEIN_1_NDUFV1 PTHR11780.orig.30.pir ENERGY_CONSERVATION ATP NuoF/Complex_I_51kDa PTHR11780.SF10.mod NADH_DEHYDROGENASE_[UBIQUINONE]_FLAVOPROTEIN_1,_MITOCHONDRIAL PTHR11780.SF10.pir ENERGY_CONSERVATION ATP NuoE/Complex_I_24kDa PTHR10371.mag.mod NADH_DEHYDROGENASE_UBIQUINONE_FLAVOPROTEIN_2,_MITOCHONDRIAL PTHR10371.orig.30.pir ENERGY_CONSERVATION ATP NuoE/Complex_I_24kDa PTHR10371.SF3.mod NADH_DEHYDROGENASE_[UBIQUINONE]_FLAVOPROTEIN_2,_MITOCHONDRIAL PTHR10371.SF3.pir ENERGY_CONSERVATION ATP ASCT-1B PTHR21432.mag.mod ACETYL-COA_HYDROLASE-RELATED PTHR21432.orig.30.pir ENERGY_CONSERVATION ATP ASCT-1B PTHR21432.SF20.mod ACETYL-COA_DEACYLASE PTHR21432.SF20.pir ENERGY_CONSERVATION ATP ASCT-1C PTHR43609.mag.mod ACETYL-COA_HYDROLASE PTHR43609.orig.30.pir ENERGY_CONSERVATION ATP ASCT-1C PTHR43609.SF1.mod ACETYL-COA_HYDROLASE PTHR43609.SF1.pir ENERGY_CONSERVATION ATP SCSa/STKa PTHR11117.mag.mod SUCCINYL-COA_LIGASE_SUBUNIT_ALPHA PTHR11117.orig.30.pir ENERGY_CONSERVATION ATP SCSa/STKa PTHR11117.SF2.mod SUCCINATE--COA_LIGASE_[ADP/GDP-FORMING]_SUBUNIT_ALPHA,_MITOCHONDRIAL PTHR11117.SF2.pir ENERGY_CONSERVATION ATP SCSb/STKb PTHR11815.mag.mod SUCCINYL-COA_SYNTHETASE_BETA_CHAIN PTHR11815.orig.30.pir ENERGY_CONSERVATION ATP SCSb/STKb PTHR11815.SF10.mod SUCCINATE--COA_LIGASE_[GDP-FORMING]_SUBUNIT_BETA,_MITOCHONDRIAL PTHR11815.SF10.pir ENERGY_CONSERVATION ATP DLDH PTHR43716.mag.mod D-2-HYDROXYGLUTARATE_DEHYDROGENASE,_MITOCHONDRIAL PTHR43716.orig.30.pir ENERGY_CONSERVATION ATP DLDH PTHR43716.SF1.mod D-2-HYDROXYGLUTARATE_DEHYDROGENASE,_MITOCHONDRIAL PTHR43716.SF1.pir ENERGY_CONSERVATION ATP CysJ-fused_HydA PTHR19384.mag.mod NITRIC_OXIDE_SYNTHASE-RELATED PTHR19384.orig.30.pir ENERGY_CONSERVATION ATP CysJ-fused_HydA PTHR19384.SF109.mod SULFITE_REDUCTASE_[NADPH]_FLAVOPROTEIN_COMPONENT PTHR19384.SF109.pir ENERGY_CONSERVATION ATP CysJ-fused_HydA PTHR19384.SF17.mod NADPH--CYTOCHROME_P450_REDUCTASE PTHR19384.SF17.pir ENERGY_CONSERVATION ATP PFL PTHR30191.mag.mod FORMATE_ACETYLTRANSFERASE PTHR30191.orig.30.pir ENERGY_CONSERVATION ATP PFL PTHR30191.SF0.mod FORMATE_ACETYLTRANSFERASE_1 PTHR30191.SF0.pir ENERGY_CONSERVATION ATP ACS PTHR42793.mag.mod COA_BINDING_DOMAIN_CONTAINING_PROTEIN PTHR42793.orig.30.pir ENERGY_CONSERVATION ATP ACS PTHR42793.SF1.mod PEPTIDYL-LYSINE_N-ACETYLTRANSFERASE_PATZ PTHR42793.SF1.pir ENERGY_CONSERVATION ATP ACS PTHR43334.mag.mod ACETATE--COA_LIGASE_[ADP-FORMING] PTHR43334.orig.30.pir ENERGY_CONSERVATION ATP ACS PTHR43334.SF1.mod ACETATE--COA_LIGASE_[ADP-FORMING] PTHR43334.SF1.pir ENERGY_CONSERVATION ATP MM-CoA-epimerase PTHR43048.mag.mod METHYLMALONYL-COA_EPIMERASE PTHR43048.orig.30.pir ENERGY_CONSERVATION ATP MM-CoA-epimerase PTHR43048.SF3.mod METHYLMALONYL-COA_EPIMERASE,_MITOCHONDRIAL PTHR43048.SF3.pir ENERGY_CONSERVATION ATP MM-CoA-mutase PTHR48101.SF4.mod METHYLMALONYL-COA_MUTASE,_MITOCHONDRIAL PTHR48101.SF4.pir ENERGY_CONSERVATION ATP MM-CoA-mutase PTHR48101.mag.mod METHYLMALONYL-COA_MUTASE,_MITOCHONDRIAL-RELATED PTHR48101.orig.30.pir ENERGY_CONSERVATION ATP MMSA-dehydrogenase PTHR43866.mag.mod MALONATE-SEMIALDEHYDE_DEHYDROGENASE PTHR43866.orig.30.pir ENERGY_CONSERVATION ATP MMSA-dehydrogenase PTHR43866.SF3.mod METHYLMALONATE-SEMIALDEHYDE_DEHYDROGENASE_[ACYLATING],_MITOCHONDRIAL PTHR43866.SF3.pir ENERGY_CONSERVATION ATP PCCa PTHR18866.mag.mod CARBOXYLASE:PYRUVATE/ACETYL-COA/PROPIONYL-COA_CARBOXYLASE PTHR18866.orig.30.pir ENERGY_CONSERVATION ATP PCCa PTHR18866.SF33.mod METHYLCROTONOYL-COA_CARBOXYLASE_SUBUNIT_ALPHA,_MITOCHONDRIAL-RELATED PTHR18866.SF33.pir ENERGY_CONSERVATION ATP PCCb PTHR43842.mag.mod PROPIONYL-COA_CARBOXYLASE_BETA_CHAIN PTHR43842.orig.30.pir ENERGY_CONSERVATION ATP PCCb PTHR43842.SF2.mod PROPIONYL-COA_CARBOXYLASE_BETA_CHAIN-RELATED PTHR43842.SF2.pir ENERGY_CONSERVATION ATP PCCb PTHR45728.SF3.mod ACETYL-COA_CARBOXYLASE PTHR45728.SF3.pir ENERGY_CONSERVATION ATP PCCb PTHR45728.mag.mod ACETYL-COA_CARBOXYLASE,_ISOFORM_A PTHR45728.orig.30.pir ENERGY_CONSERVATION ATP 3HB-dehydrogenase PTHR43060.mag.mod 3-HYDROXYISOBUTYRATE_DEHYDROGENASE-LIKE_1,_MITOCHONDRIAL-RELATED PTHR43060.orig.30.pir ENERGY_CONSERVATION ATP 3HB-dehydrogenase PTHR43060.SF15.mod 3-HYDROXYISOBUTYRATE_DEHYDROGENASE-LIKE_1,_MITOCHONDRIAL-RELATED PTHR43060.SF15.pir
NOTE that I've included for every gene the family and the subfamily identifier. So, which of these are in our genome?
cat ergo_vs_panther17.filt.ext.domtblout \
| body rg -f <(cut -f 6 families_of_interest.ATP.upd.list) \
| body sort -k28,28 \
> ergo_vs_panther17.ATP.ext.domtblout
rg is a modern unix tool that is very similar to the classic grep (it is literally called ripgrep) but its a lot faster, particularly with the -f option.
We now get the following file
TargetName TargetAcc TargetLen QueryName QueryAcc QueryLen seq-E-value seq-Score seq-Bias dom-No dom-Count dom-c-Evalue dom-i-Evalue dom-Score dom-Bias hmm-From hmm-To ali-From ali-To env-From env-To mean-PP DescriptionOfTarget qcovhsp tcovhsp qcovs tcovs annotation PTHR43726.orig.30.pir - 384 FUN_005298-T1 - 390 7.1e-100 337.2 0.0 1 1 8.8e-104 8.9e-100 336.9 0.0 45 380 18 380 5 383 0.94 - 0.93 0.88 0.93 0.88 3-METHYLORNITHINE_SYNTHASE PTHR43334.orig.30.pir - 705 FUN_004883-T1 - 761 1.1e-247 826.9 0.0 1 1 1.3e-251 1.4e-247 826.6 0.0 4 702 31 733 29 736 0.98 - 0.92 0.99 0.92 0.99 ACETATE--COA_LIGASE_[ADP-FORMING] PTHR43334.orig.30.pir - 705 FUN_004897-T1 - 647 1.4e-224 750.4 0.0 1 1 1.7e-228 1.8e-224 750.1 0.0 1 630 12 643 12 646 0.98 - 0.98 0.89 0.98 0.89 ACETATE--COA_LIGASE_[ADP-FORMING] PTHR43583.SF2.pir - 498 FUN_005007-T1 - 570 6e-189 631.4 0.0 1 1 5.7e-193 7.3e-189 631.1 0.0 19 498 58 570 31 570 0.97 - 0.90 0.96 0.90 0.96 BIOTIN_AND_THIAMIN_SYNTHESIS_ASSOCIATED_DOMAIN_CONTAINING_PROTEIN PTHR43583.SF2.pir - 498 FUN_005031-T1 - 570 7.6e-189 631.1 0.0 1 1 7.2e-193 9.3e-189 630.8 0.0 19 498 58 570 31 570 0.97 - 0.90 0.96 0.90 0.96 BIOTIN_AND_THIAMIN_SYNTHESIS_ASSOCIATED_DOMAIN_CONTAINING_PROTEIN PTHR11615.SF345.pir - 585 FUN_003250-T1 - 583 5.5e-167 559.9 0.2 1 1 1.1e-170 6.3e-167 559.7 0.2 6 581 5 579 1 582 0.94 - 0.99 0.98 0.99 0.98 CATALYTIC_SUBUNIT_OF_IRON-ONLY_HYDROGENASE-RELATED PTHR11615.SF345.pir - 585 FUN_003742-T1 - 461 3.9e-127 428.3 1.4 1 1 5.8e-131 6.9e-127 427.5 1.4 139 580 31 460 4 461 0.92 - 0.93 0.76 0.93 0.76 CATALYTIC_SUBUNIT_OF_IRON-ONLY_HYDROGENASE-RELATED PTHR42793.orig.30.pir - 472 FUN_000469-T1 - 843 1.5e-104 353.0 0.0 1 1 5.9e-109 1.7e-104 352.9 0.0 5 460 352 842 146 843 0.93 - 0.58 0.97 0.58 0.97 COA_BINDING_DOMAIN_CONTAINING_PROTEIN PTHR23426.SF65.pir - 153 FUN_000689-T1 - 116 1.5e-13 54.3 0.0 1 1 1.7e-17 1.7e-13 54.1 0.0 57 142 28 109 14 115 0.87 - 0.71 0.56 0.71 0.56 FERREDOXIN_1-RELATED PTHR23426.SF65.pir - 153 FUN_001937-T1 - 106 2.1e-13 53.8 0.1 1 1 1.9e-17 2.4e-13 53.6 0.1 46 139 8 99 3 105 0.81 - 0.87 0.61 0.87 0.61 FERREDOXIN_1-RELATED PTHR30191.SF0.pir - 749 FUN_003588-T1 - 742 0.0 1182.9 0.1 1 1 0.0 0.0 1182.7 0.1 20 748 4 741 1 742 0.98 - 0.99 0.97 0.99 0.97 FORMATE_ACETYLTRANSFERASE_1 PTHR30191.SF0.pir - 749 FUN_004683-T1 - 779 0.0 1200.2 0.0 1 1 0.0 0.0 1199.9 0.0 15 748 36 778 26 779 0.98 - 0.95 0.98 0.95 0.98 FORMATE_ACETYLTRANSFERASE_1 PTHR30191.SF0.pir - 749 FUN_004725-T1 - 744 0.0 1199.6 0.0 1 1 0.0 0.0 1199.4 0.0 21 748 7 743 1 744 0.98 - 0.99 0.97 0.99 0.97 FORMATE_ACETYLTRANSFERASE_1 PTHR30191.SF0.pir - 749 FUN_004737-T1 - 744 0.0 1199.6 0.0 1 1 0.0 0.0 1199.4 0.0 21 748 7 743 1 744 0.98 - 0.99 0.97 0.99 0.97 FORMATE_ACETYLTRANSFERASE_1 PTHR30191.SF0.pir - 749 FUN_006165-T1 - 744 0.0 1196.1 0.0 1 1 0.0 0.0 1195.9 0.0 22 748 8 743 3 744 0.98 - 0.99 0.97 0.99 0.97 FORMATE_ACETYLTRANSFERASE_1 PTHR11615.SF304.pir - 431 FUN_004893-T1 - 1202 5.6e-80 272.2 0.0 1 2 0.0045 19.0 6.9 0.8 44 65 219 257 147 271 0.7 - 0.03 0.05 0.63 0.89 IRON_HYDROGENASE,_PUTATIVE-RELATED PTHR11615.SF304.pir - 431 FUN_005208-T1 - 1444 5.2e-84 285.5 0.0 1 4 1.1e-06 0.0045 18.8 5.8 44 75 288 340 187 347 0.72 - 0.04 0.07 0.38 0.86 IRON_HYDROGENASE,_PUTATIVE-RELATED PTHR11780.SF10.pir - 449 FUN_001132-T1 - 414 4.3e-83 282.2 0.0 1 2 1.8e-53 3.6e-49 170.4 0.0 67 210 2 180 1 194 0.97 - 0.43 0.32 0.74 0.59 NADH_DEHYDROGENASE_[UBIQUINONE]_FLAVOPROTEIN_1,_MITOCHONDRIAL PTHR10371.orig.30.pir - 230 FUN_002223-T1 - 208 7.2e-51 175.7 0.0 1 1 2.6e-55 7.4e-51 175.7 0.0 29 183 8 166 2 208 0.88 - 0.76 0.67 0.76 0.67 NADH_DEHYDROGENASE__UBIQUINONE__FLAVOPROTEIN_2,_MITOCHONDRIAL PTHR11780.orig.30.pir - 449 FUN_000599-T1 - 466 8.9e-152 508.6 0.0 1 1 7.2e-156 1.1e-151 508.2 0.0 14 430 38 456 29 464 0.95 - 0.90 0.93 0.90 0.93 NADH-UBIQUINONE_OXIDOREDUCTASE_FLAVOPROTEIN_1__NDUFV1 PTHR23406.SF32.pir - 596 FUN_001722-T1 - 585 2.8e-186 623.3 0.0 1 1 6.5e-190 3.4e-186 623.1 0.0 43 591 20 585 2 585 0.93 - 0.97 0.92 0.97 0.92 NADP-DEPENDENT_MALIC_ENZYME PTHR23406.SF32.pir - 596 FUN_003026-T1 - 563 4.2e-195 652.5 0.0 1 1 9.7e-199 5.1e-195 652.2 0.0 39 593 11 552 2 555 0.96 - 0.96 0.93 0.96 0.93 NADP-DEPENDENT_MALIC_ENZYME PTHR23406.SF32.pir - 596 FUN_003030-T1 - 563 4.2e-195 652.5 0.0 1 1 9.7e-199 5.1e-195 652.2 0.0 39 593 11 552 2 555 0.96 - 0.96 0.93 0.96 0.93 NADP-DEPENDENT_MALIC_ENZYME PTHR32154.SF0.pir - 1159 FUN_000496-T1 - 1228 0.0 1568.0 0.0 1 1 0.0 0.0 1567.6 0.0 3 1156 51 1226 48 1227 0.98 - 0.96 1.00 0.96 1.00 PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED PTHR32154.SF0.pir - 1159 FUN_000974-T1 - 1261 0.0 1402.3 0.0 1 1 0.0 0.0 1402.0 0.0 5 1154 81 1256 78 1261 0.97 - 0.93 0.99 0.93 0.99 PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED PTHR32154.SF0.pir - 1159 FUN_003906-T1 - 1427 0.0 1268.7 0.0 1 1 0.0 0.0 1264.6 0.0 5 1156 101 1383 98 1394 0.95 - 0.90 0.99 0.90 0.99 PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED PTHR32154.SF0.pir - 1159 FUN_004094-T1 - 1404 0.0 1481.8 0.0 1 1 0.0 0.0 1481.4 0.0 4 1154 3 1223 1 1228 0.96 - 0.87 0.99 0.87 0.99 PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED PTHR11117.SF2.pir - 320 FUN_001614-T1 - 326 6.7e-140 468.5 4.9 1 1 1.1e-143 7.9e-140 468.2 4.9 21 318 25 319 6 321 0.96 - 0.90 0.93 0.90 0.93 SUCCINATE--COA_LIGASE_[ADP/GDP-FORMING]_SUBUNIT_ALPHA,_MITOCHONDRIAL PTHR11815.orig.30.pir - 402 FUN_002952-T1 - 408 3.8e-159 532.8 0.0 1 1 3.9e-163 4.6e-159 532.5 0.0 5 399 7 405 4 408 0.98 - 0.98 0.98 0.98 0.98 SUCCINYL-COA_SYNTHETASE_BETA_CHAIN PTHR42714.SF6.pir - 467 FUN_004742-T1 - 491 6.2e-29 104.5 0.0 1 1 5e-31 4.2e-27 98.5 0.0 70 441 18 413 15 434 0.71 - 0.81 0.80 0.81 0.80 TRANSLATION_INITIATION_FACTOR_IF-2
To verify that each of these proteins are truly orthologs of the PANTHER families, we'd have to make trees, but this is a good starting point.
NOTE that we purposefully started by comparing our proteins to the entire PANTHER HMM-profile database, rather than starting out from a selection of HMM-profiles of gene families that we know we are interested in. This prevents false positive associations between our protein and and HMM profile:
Let's say that our genome doesn't have an ortholog of PANTHER family A, but it does have a distant paralog of that family (represented in the PANTHER database by family B). We are interested in A, but not in B. If we search with PANTHER family A, but not with family B, we may get a significant hit against family A, even though its not a true ortholog. If we were to include family B in our search as well, then we'd get significant hits against family A and B, but the score against B would be much better. It is thus clear that our protein is not a true ortholog of family A.

