This is an old revision of the document!

Finding genes of interest with PANTHER HMMs

By Joran Martijn (Sep 2023)

The PANTHER database

The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. The goal of PANTHER is to classify proteins by function.

The PANTHER initiative started in 1998 and was the first online database that included protein phylogenetic trees. Initially, very few species had whole genome sequences available but starting with version 6 (released in 2006), many more species were added. PANTHER16 (or 17?) included 142 fully sequenced genomes, including 19 vertebrates, 15 invertebrates, 14 fungi, 40 plants, 11 other eukaryotes, 8 archaea and 35 bacteria. The list is thus fairly limited and, more focused towards non-protist eukaryotes.

In 2002, PANTHER3 had 271779 proteins over 6155 families. In 2020, PANTHER16 had 2063337 proteins over 15635 families.

PANTHER families are defined as clusters of related proteins for which a good multiple sequence alignment can be made (defined in their original 2003 paper ). The emphasis is thus not so much on trying to reconstruct exact, complete orthogroups, but rather groupings that are useful to predict protein function on new protein sequences.

A key aspect of the PANTHER database is the division of families into multiple subfamilies. PANTHER's focus is on function and to a lesser degree evolutionary history, hence they attempt to classify proteins with distinct functions into distinct subfamilies. The idea is thus that each different subfamily has a distinct function.

The splitting of families into subfamilies was initially I think mostly done manually by trained biologists. However, in the most recent versions, it seems to be a mostly automated process but with the results still curated by trained biologists. In the 2021 paper a subfamily is defined as follows

A subfamily roughly corresponds to a group of least diverged orthologs, 
in that most members of each subfamily are mutually least diverged orthologs 
of all other members of the subfamily

I find this definition a bit cryptic, but it becomes a bit clearer with the visual example that they provide in the 2021 paper.

Here, sequences from distinct subfamilies are given a different color (blue, red, brown, pink, green). Speciation events are marked with green circles, duplication events with orange circles. Triangles are collapsed nodes, and diamonds indicate the root nodes of distinct subfamilies.

What becomes immediately apparent is that subfamilies are not necessary monophyletic! For example, the blue subfamily are polyphyletic (or paraphyletic?).

This is because of a duplication event that happened in the bottom daughter of the global root node. Two branches flow from the duplication event: a shorter one and a longer one. In PANTHERs definition, the sequences emerging from the shorter one comprise the least diverged orthologs (orthologs relative to the DANRE/ORYLA lineage on the top of the figure), and thus they belong to the same subfamily as the DANRE/ORYLA proteins. The sequences emerging from the longer one are thus NOT the least diverged, and hence they belong to a distinct subfamily.

Thus, any time a duplication event happens, a new subfamily is spawned. The exception is when the duplication leads to in-paralogs, or technically more specifically, to within-genome paralogs. That is why the duplication to the ORYLA and ORNAN copies did not spawn new subfamilies. The implication here is thus that after a duplication, one of the copies evolves to assume a new functional role.

Each family and subfamily in the database has associated with it a MSA, a tree and an HMM. The HMMs are useful to detect orthologs of these families and subfamilies in your pet genome. It is also useful to compare the E-values / bit-scores between families and subfamilies. If a novel predicted protein from your pet genome scores more highly against a family than a subfamily, it may indicate that your protein represents a novel subfamily.

PANTHER on Perun

Currently, the PANTHER17 database is stored on Perun here /scratch3/rogerlab_databases/other_dbs/PANTHER17.0/. It contains two subdirectories, books/ and globals/. The globals directory contains the actual database of HMMs, binHmm.* files, and names.tab, a file containing the actual annotation of each family and subfamily. Let's take a closer look:

PTHR10000.mag.mod       PHOSPHOSERINE PHOSPHATASE
PTHR10000.SF55.mod      5-AMINO-6-(5-PHOSPHO-D-RIBITYLAMINO)URACIL PHOSPHATASE YCSE
PTHR10000.SF23.mod      5-AMINO-6-(5-PHOSPHO-D-RIBITYLAMINO)URACIL PHOSPHATASE YITU
PTHR10000.SF57.mod      KANOSAMINE-6-PHOSPHATE PHOSPHATASE
...                     ...
PTHR10000.SF56.mod      MANNOSYL-3-PHOSPHOGLYCERATE PHOSPHATASE
PTHR10003.mag.mod       SUPEROXIDE DISMUTASE  CU-ZN -RELATED
PTHR10003.SF95.mod      SUPEROXIDE DISMUTASE [CU-ZN] 2, CHLOROPLASTIC
PTHR10003.SF82.mod      SOD_CU DOMAIN-CONTAINING PROTEIN
PTHR10003.SF94.mod      EXTRACELLULAR SUPEROXIDE DISMUTASE [CU-ZN] 3
PTHR10003.SF31.mod      SUPEROXIDE DISMUTASE [CU-ZN] 3
...                     ...

Those ending with .mag.mod are families, and those ending with .SF[number].mod are their respective subfamilies.

The Light_panther17.tsv file is a simple list of all families.

Searching your genome for PANTHER families homologs

Do an hmmscan

Simply do an hmmscan from the HMMER3 package. hmmscan takes as a query a single or a list of protein sequences and compares them with a database of HMM profiles. Below, we are taking the entire PANTHER17 database of HMMs as a database.

hmmscan \
   -E 0.001 --incE 0.0001 \
   --domE 0.01 --incdomE 0.01 \
   --notextw --cpu 10 \
   --tblout ergo_vs_panther17.tblout --domtblout ergo_vs_panther17.domtblout
   binHmm Ergobibamus_cyprinoides_CL.proteins.fa

This will yield two outfiles, the ergo_vs_panther17.tblout and ergo_vs_panther17.domtblout.

The .tblout file is a tabular output file where each line has a unique query (a protein sequence) and target (the HMM profile) combination. The overall E-value and score between HMM and sequence are reported (“full sequence”), as well as the segment of the protein that scored the highest with said HMM (“best 1 domain”), as well as a bunch of metrics for estimating the number of “domains”.

#                                                                --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name         accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
# ------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
PTHR31752.SF18.pir    -          FUN_000001-T1        -            6.8e-23   83.9   5.0   3.3e-12   48.7   2.4   2.1   1   1   1   2   2   2   2 -
PTHR36838.orig.30.pir -          FUN_000001-T1        -            5.6e-22   81.0  13.0     7e-22   80.7   1.8   2.2   2   0   0   2   2   2   2 -
PTHR36838.SF1.pir     -          FUN_000001-T1        -            1.9e-19   72.8   4.4   1.9e-19   72.8   4.4   2.4   2   0   0   2   2   1   1 -
PTHR31752.orig.30.pir -          FUN_000001-T1        -            3.3e-16   62.0   4.6   2.6e-14   55.7   4.6   2.2   1   1   0   1   1   1   1 -
PTHR36838.SF3.pir     -          FUN_000001-T1        -            1.4e-12   50.3   3.7   2.6e-12   49.3   0.1   2.2   2   0   0   2   2   1   1 -
PTHR31752.SF40.pir    -          FUN_000001-T1        -            1.2e-11   47.3   0.0     2e-11   46.6   0.0   1.3   1   0   0   1   1   1   1 -
PTHR31752.SF2.pir     -          FUN_000001-T1        -            5.5e-09   38.3   0.0     7e-09   38.0   0.0   1.2   1   0   0   1   1   1   1 -
PTHR31752.SF49.pir    -          FUN_000001-T1        -            3.4e-08   35.9   0.0   5.9e-08   35.2   0.0   1.4   1   0   0   1   1   1   1 -
...

NOTE that the outfile uses the word “domain” because traditionally HMM profiles describe protein domains. The Pfam is a good example of a database of domain HMM profiles. PANTHER HMM describes entire proteins. Hence the term 'domain' is not very well suited here, and can be a bit confusing.

NOTE also that the names of the families are slightly different here. Instead of ending with .mag.mod or .SF[number].mod, they end with .orig.30.pir and .SF[number].pir.

A protein sequence may have multiple segments that each match well with a particular HMM profile independently. To see each of these alignments separately, we can inspect the .domtblout file:

#                                                                             --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name         accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
# ------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
PTHR31752.SF18.pir    -            418 FUN_000001-T1        -            412   6.8e-23   83.9   5.0   1   2   1.9e-16   3.3e-12   48.7   2.4     9   167    10   174     4   201 0.78 -
PTHR31752.SF18.pir    -            418 FUN_000001-T1        -            412   6.8e-23   83.9   5.0   2   2   1.4e-12   2.4e-08   35.9   0.0   261   369   210   350   182   404 0.72 -
PTHR36838.orig.30.pir -            308 FUN_000001-T1        -            412   5.6e-22   81.0  13.0   1   2     4e-26     7e-22   80.7   1.8    10   214    15   272     9   281 0.82 -
PTHR36838.orig.30.pir -            308 FUN_000001-T1        -            412   5.6e-22   81.0  13.0   2   2    0.0028        50    5.6   3.8   222   276   310   367   301   398 0.80 -
PTHR36838.SF1.pir     -            308 FUN_000001-T1        -            412   1.9e-19   72.8   4.4   1   1   1.1e-23   1.9e-19   72.8   4.4     9   211    14   269     8   276 0.82 -
PTHR31752.orig.30.pir -            405 FUN_000001-T1        -            412   3.3e-16   62.0   4.6   1   1   1.4e-18   2.6e-14   55.7   4.6    10   340    10   330     3   402 0.73 -
PTHR36838.SF3.pir     -            312 FUN_000001-T1        -            412   1.4e-12   50.3   3.7   1   1   1.5e-16   2.6e-12   49.3   0.1    10   216    16   272    10   279 0.77 -
PTHR31752.SF40.pir    -            360 FUN_000001-T1        -            412   1.2e-11   47.3   0.0   1   1   1.1e-15     2e-11   46.6   0.0    37   267    36   274     8   282 0.75 -
PTHR31752.SF2.pir     -            349 FUN_000001-T1        -            412   5.5e-09   38.3   0.0   1   1     4e-13     7e-09   38.0   0.0   103   284   105   332    60   400 0.70 -
...

Now, each line represents a unique “domain” (i.e. segment of a protein sequence) and HMM profile combination. The overall E-value and score between HMM and the full sequence is still reported, as well as the scores of the specific “domains” that were found for each sequence.

Add HMM profile coverage information

You may want to reduce the number of false positives (i.e. protein sequences that match with an HMM but are not really orthologs) by discarding queries that only have partial alignments with the HMM. With BLAST I like to use the fields qcovhsp, qcovs, which describe how much of the query was covered by the HSP (High scoring Segment Pair, or in simpler terms, the alignment between query and hit) (qcovhsp) or collection of alignments between the same query and hit (qcovs). For example, if an alignment covered less than, say 60% of the length of the query, I may consider it a false positive hit. Unfortunately, HMMER does not directly report such metrics, but it does report the metrics necessary to calculate them. I wrote a python script, extend_hmmer_domtblout.py that does exactly this:

extend_hmmer_domtblout.py -m hmmscan -i ergo_vs_panther17.domtblout -o ergo_vs_panther17.ext.domtblout

This yields a file that looks like this

TargetName             TargetAcc  TargetLen  QueryName      QueryAcc  QueryLen  seq-E-value  seq-Score  seq-Bias  dom-No  dom-Count  dom-c-Evalue  dom-i-Evalue  dom-Score  dom-Bias  hmm-From  hmm-To  ali-From  ali-To  env-From  env-To  mean-PP  DescriptionOfTarget  qcovhsp  tcovhsp  qcovs  tcovs
PTHR31752.SF18.pir     -          418        FUN_000001-T1  -         412       6.8e-23      83.9       5.0       1       2          1.9e-16       3.3e-12       48.7       2.4       9         167     10        174     4         201     0.78     -                    0.40     0.38     0.74   0.64
PTHR31752.SF18.pir     -          418        FUN_000001-T1  -         412       6.8e-23      83.9       5.0       2       2          1.4e-12       2.4e-08       35.9       0.0       261       369     210       350     182       404     0.72     -                    0.34     0.26     0.74   0.64
PTHR36838.orig.30.pir  -          308        FUN_000001-T1  -         412       5.6e-22      81.0       13.0      1       2          4e-26         7e-22         80.7       1.8       10        214     15        272     9         281     0.82     -                    0.63     0.67     0.77   0.84
PTHR36838.orig.30.pir  -          308        FUN_000001-T1  -         412       5.6e-22      81.0       13.0      2       2          0.0028        50.0          5.6        3.8       222       276     310       367     301       398     0.8      -                    0.14     0.18     0.77   0.84
PTHR36838.SF1.pir      -          308        FUN_000001-T1  -         412       1.9e-19      72.8       4.4       1       1          1.1e-23       1.9e-19       72.8       4.4       9         211     14        269     8         276     0.82     -                    0.62     0.66     0.62   0.66
PTHR31752.orig.30.pir  -          405        FUN_000001-T1  -         412       3.3e-16      62.0       4.6       1       1          1.4e-18       2.6e-14       55.7       4.6       10        340     10        330     3         402     0.73     -                    0.78     0.82     0.78   0.82
PTHR36838.SF3.pir      -          312        FUN_000001-T1  -         412       1.4e-12      50.3       3.7       1       1          1.5e-16       2.6e-12       49.3       0.1       10        216     16        272     10        279     0.77     -                    0.62     0.66     0.62   0.66
PTHR31752.SF40.pir     -          360        FUN_000001-T1  -         412       1.2e-11      47.3       0.0       1       1          1.1e-15       2e-11         46.6       0.0       37        267     36        274     8         282     0.75     -                    0.58     0.64     0.58   0.64
PTHR31752.SF2.pir      -          349        FUN_000001-T1  -         412       5.5e-09      38.3       0.0       1       1          4e-13         7e-09         38.0       0.0       103       284     105       332     60        400     0.7      -                    0.55     0.52     0.55   0.52
PTHR31752.SF49.pir     -          358        FUN_000001-T1  -         412       3.4e-08      35.9       0.0       1       1          3.3e-12       5.9e-08       35.2       0.0       39        309     38        319     6         370     0.72     -                    0.68     0.76     0.68   0.76
PTHR31913.orig.30.pir  -          655        FUN_000003-T1  -         778       1e-54        189.4      0.0       1       1          7.2e-59       2e-54         188.4      0.0       203       649     350       776     296       778     0.83     -                    0.55     0.68     0.55   0.68
PTHR31913.SF0.pir      -          659        FUN_000003-T1  -         778       8e-54        186.4      0.0       1       1          5.1e-58       1.4e-53       185.6      0.0       313       653     441       776     332       778     0.88     -                    0.43     0.52     0.43   0.52
PTHR31913.SF7.pir      -          619        FUN_000003-T1  -         778       7.6e-15      57.4       0.0       1       1          4.9e-19       1.4e-14       56.6       0.0       325       551     516       730     506       739     0.85     -                    0.28     0.37     0.28   0.37
...

The script does a few things. It reformats the header line, and it makes the whole table tab delimited, which makes it easier to parse later on. Most importantly, it adds 4 new columns, qcovhsp, tcovhsp, qcovs and tcovs.

qcovhsp        how much of the query (the protein sequence) is covered by the HMM:protein alignment
tcovhsp        how much of the target (the HMM profile) is covered by the HMM:protein alignment
qcovs          how much of the query (the protein sequence) is covered _in_total_ by all HMM:protein alignments between this HMM and this protein
tcovs          how much of the target (the HMM profile) is covered _in_total_ by all HMM:protein alignments between this HMM and this protein

Add PANTHER annotation

Knowing which families and subfamilies had hits in your genome is nice and all, but it's just a bunch of unintelligible identifiers. I wrote another script that links these identifiers with their actual annotations (this linking information is found in names.tab and adds the actual annotations to the .domtblout file

add_pthr_annotation_to_hmmer_out.py -i ergo_vs_panther17.ext.domtblout -n names.tab > ergo_vs_panther17.annot.ext.domtblout

The output file will look something like this

TargetName             TargetAcc  TargetLen  QueryName      QueryAcc  QueryLen  seq-E-value  seq-Score  seq-Bias  dom-No  dom-Count  dom-c-Evalue  dom-i-Evalue  dom-Score  dom-Bias  hmm-From  hmm-To  ali-From  ali-To  env-From  env-To  mean-PP  DescriptionOfTarget  qcovhsp  tcovhsp  qcovs  tcovs    annotation
PTHR31752.SF18.pir     -          418        FUN_000001-T1  -         412       6.8e-23      83.9       5.0       1       2          1.9e-16       3.3e-12       48.7       2.4       9         167     10        174     4         201     0.78     -                    0.40     0.38     0.74   0.64     AUXIN_EFFLUX_CARRIER_COMPONENT_3A
PTHR31752.SF18.pir     -          418        FUN_000001-T1  -         412       6.8e-23      83.9       5.0       2       2          1.4e-12       2.4e-08       35.9       0.0       261       369     210       350     182       404     0.72     -                    0.34     0.26     0.74   0.64     AUXIN_EFFLUX_CARRIER_COMPONENT_3A
PTHR36838.orig.30.pir  -          308        FUN_000001-T1  -         412       5.6e-22      81.0       13.0      1       2          4e-26         7e-22         80.7       1.8       10        214     15        272     9         281     0.82     -                    0.63     0.67     0.77   0.84     AUXIN_EFFLUX_CARRIER_FAMILY_PROTEIN
PTHR36838.orig.30.pir  -          308        FUN_000001-T1  -         412       5.6e-22      81.0       13.0      2       2          0.0028        50.0          5.6        3.8       222       276     310       367     301       398     0.8      -                    0.14     0.18     0.77   0.84     AUXIN_EFFLUX_CARRIER_FAMILY_PROTEIN
PTHR36838.SF1.pir      -          308        FUN_000001-T1  -         412       1.9e-19      72.8       4.4       1       1          1.1e-23       1.9e-19       72.8       4.4       9         211     14        269     8         276     0.82     -                    0.62     0.66     0.62   0.66     SLR1864_PROTEIN
PTHR31752.orig.30.pir  -          405        FUN_000001-T1  -         412       3.3e-16      62.0       4.6       1       1          1.4e-18       2.6e-14       55.7       4.6       10        340     10        330     3         402     0.73     -                    0.78     0.82     0.78   0.82     AUXIN_EFFLUX_CARRIER_COMPONENT_1B-RELATED
PTHR36838.SF3.pir      -          312        FUN_000001-T1  -         412       1.4e-12      50.3       3.7       1       1          1.5e-16       2.6e-12       49.3       0.1       10        216     16        272     10        279     0.77     -                    0.62     0.66     0.62   0.66     AUXIN_EFFLUX_CARRIER_FAMILY_PROTEIN
PTHR31752.SF40.pir     -          360        FUN_000001-T1  -         412       1.2e-11      47.3       0.0       1       1          1.1e-15       2e-11         46.6       0.0       37        267     36        274     8         282     0.75     -                    0.58     0.64     0.58   0.64     AUXIN_EFFLUX_CARRIER_COMPONENT_8
PTHR31752.SF2.pir      -          349        FUN_000001-T1  -         412       5.5e-09      38.3       0.0       1       1          4e-13         7e-09         38.0       0.0       103       284     105       332     60        400     0.7      -                    0.55     0.52     0.55   0.52     AUXIN_EFFLUX_CARRIER_COMPONENT_5
PTHR31752.SF49.pir     -          358        FUN_000001-T1  -         412       3.4e-08      35.9       0.0       1       1          3.3e-12       5.9e-08       35.2       0.0       39        309     38        319     6         370     0.72     -                    0.68     0.76     0.68   0.76     AUXIN_EFFLUX_CARRIER_COMPONENT
PTHR31913.orig.30.pir  -          655        FUN_000003-T1  -         778       1e-54        189.4      0.0       1       1          7.2e-59       2e-54         188.4      0.0       203       649     350       776     296       778     0.83     -                    0.55     0.68     0.55   0.68     VACUOLAR_IMPORT_AND_DEGRADATION_PROTEIN_27
PTHR31913.SF0.pir      -          659        FUN_000003-T1  -         778       8e-54        186.4      0.0       1       1          5.1e-58       1.4e-53       185.6      0.0       313       653     441       776     332       778     0.88     -                    0.43     0.52     0.43   0.52     VACUOLAR_IMPORT_AND_DEGRADATION_PROTEIN_27
PTHR31913.SF7.pir      -          619        FUN_000003-T1  -         778       7.6e-15      57.4       0.0       1       1          4.9e-19       1.4e-14       56.6       0.0       325       551     516       730     506       739     0.85     -                    0.28     0.37     0.28   0.37     DEM_PROTEIN
PTHR31913.SF5.pir      -          613        FUN_000003-T1  -         778       2.3e-12      49.0       0.1       1       1          5.5e-16       1.5e-11       46.2       0.1       231       539     442       731     434       740     0.78     -                    0.37     0.50     0.37   0.50     VID27_DOMAIN-CONTAINING_PROTEIN
PTHR31913.SF2.pir      -          622        FUN_000003-T1  -         778       4.1e-08      35.1       0.0       1       1          1.4e-11       3.9e-07       31.9       0.0       325       553     516       729     506       738     0.77     -                    0.28     0.37     0.28   0.37     OS02G0789600_PROTEIN
PTHR23359.SF250.pir    -          270        FUN_000004-T1  -         520       7.6e-74      251.2      0.0       1       2          3.2e-74       7.2e-71       241.4      0.0       59        265     48        262     21        267     0.93     -                    0.41     0.77     0.45   0.77     KINASE,_PUTATIVE-RELATED
...

Filter by HMM profile coverage and composition bias

You may only want to keep proteins of your pet genome that are (nearly) fully covered by PANTHER (sub)families to reduce the number of false positives.

I did this by only keeping lines of the .domtblout where the tcovs (the coverage of the target accumulated over all protein:HMM-profile alignments; the 27th column) is at least 50% of the HMM-profile length, and the seq-Bias score (9th column) is less than 90% of the total bit seq-Score (8th column).

After that, the table is sorted by protein sequence identifier and only the best hit, according to seq-Score remains.

function body {
    IFS= read -r header
    printf '%s\n' "$header"
    "$@"
}

cat ergo_vs_panther17.annot.ext.domtblout \
    | body awk '$27>=0.50 && $9/$8<0.9' \
    | body sort -k4,4 -k8,8nr \
    | body sort -k4,4 -u \
    > ergo_vs_panther17.filt.ext.domtblout

The body function is a nifty trick to only apply your awk or sort operations on the body of the table and leave the header intact.

Our new file looks as follows:

TargetName             TargetAcc  TargetLen  QueryName      QueryAcc  QueryLen  seq-E-value  seq-Score  seq-Bias  dom-No  dom-Count  dom-c-Evalue  dom-i-Evalue  dom-Score  dom-Bias  hmm-From  hmm-To  ali-From  ali-To  env-From  env-To  mean-PP  DescriptionOfTarget  qcovhsp  tcovhsp  qcovs  tcovs    annotation
PTHR31752.SF18.pir     -          418        FUN_000001-T1  -         412       6.8e-23      83.9       5.0       1       2          1.9e-16       3.3e-12       48.7       2.4       9         167     10        174     4         201     0.78     -                    0.40     0.38     0.74   0.64     AUXIN_EFFLUX_CARRIER_COMPONENT_3A
PTHR31913.orig.30.pir  -          655        FUN_000003-T1  -         778       1e-54        189.4      0.0       1       1          7.2e-59       2e-54         188.4      0.0       203       649     350       776     296       778     0.83     -                    0.55     0.68     0.55   0.68     VACUOLAR_IMPORT_AND_DEGRADATION_PROTEIN_27
PTHR23359.SF250.pir    -          270        FUN_000004-T1  -         520       7.6e-74      251.2      0.0       1       2          3.2e-74       7.2e-71       241.4      0.0       59        265     48        262     21        267     0.93     -                    0.41     0.77     0.45   0.77     KINASE,_PUTATIVE-RELATED
PTHR45991.orig.30.pir  -          450        FUN_000005-T1  -         544       9.4e-99      334.2      0.0       1       1          2e-101        1.3e-98       333.8      0.0       32        438     18        505     11        520     0.86     -                    0.90     0.90     0.90   0.90     PACHYTENE_CHECKPOINT_PROTEIN_2
PTHR13563.orig.30.pir  -          346        FUN_000006-T1  -         343       8.1e-52      179.5      0.0       1       1          3.6e-56       1e-51         179.1      0.0       63        321     43        320     20        343     0.79     -                    0.81     0.75     0.81   0.75     TRNA_(GUANINE-9-)_METHYLTRANSFERASE
PTHR11685.SF436.pir    -          285        FUN_000007-T1  -         589       2.4e-19      72.8       8.5       1       1          5.8e-23       2.6e-19       72.6       8.5       76        240     241       491     132       578     0.7      -                    0.43     0.58     0.43   0.58     RBR-TYPE_E3_UBIQUITIN_TRANSFERASE
PTHR46233.orig.30.pir  -          231        FUN_000011-T1  -         208       1.7e-48      167.7      0.0       1       1          5.2e-52       2.1e-48       167.4      0.0       20        214     11        206     3         207     0.94     -                    0.94     0.84     0.94   0.84     HYDROXYACYLGLUTATHIONE_HYDROLASE_GLOC
PTHR23102.SF19.pir     -          206        FUN_000013-T1  -         233       1e-31        113.1      9.9       1       1          1.1e-35       1.3e-31       112.7      9.9       58        172     38        151     31        186     0.9      -                    0.49     0.56     0.49   0.56     CLEAVAGE_AND_POLYADENYLATION_SPECIFICITY_FACTOR_SUBUNIT_4-LIKE_PROTEIN-RELATED
PTHR47979.SF50.pir     -          216        FUN_000014-T1  -         206       1.6e-105     353.9      0.0       1       1          8.5e-108      1.9e-105      353.7      0.0       1         193     1         193     1         204     0.98     -                    0.94     0.89     0.94   0.89     RAS-RELATED_PROTEIN_RAB-2B
PTHR17605.orig.30.pir  -          732        FUN_000016-T1  -         712       4.8e-175     586.8      0.0       1       1          1e-177        6.9e-175      586.3      0.0       151       729     100       710     92        712     0.92     -                    0.86     0.79     0.86   0.79     RIBOSOME_BIOGENESIS_PROTEIN_BOP1__BLOCK_OF_PROLIFERATION_1_PROTEIN
...

Each of the protein queries in this list has a significant hit against a PANTHER family, and all the alignments between this protein and the HMM-profile cumulatively cover at least 50% of the HMM-profile.

Selecting (sub)families of interest

Until now we have considered all predicted proteins of our pet genome. But perhaps we are only particularly interested for whatever reason in a subset of them.

For example, we want to know what kind of energy conservation genes are present in our genome. From studying the literature, I've compiled in families_of_interest.ATP.upd.list the following list of PANTHER families that seem to be involved in ATP synthesis. The table is formatted like “category long name”, “category short name”, “gene or protein name”, “PANTHER family identifier (names.tab)”, “PANTHER family annotation”, “PANTHER family identifier (binHmm database)”.

ENERGY_CONSERVATION  ATP  ME                           PTHR23406.mag.mod    MALIC_ENZYME-RELATED                                                  PTHR23406.orig.30.pir
ENERGY_CONSERVATION  ATP  ME                           PTHR23406.SF32.mod   NADP-DEPENDENT_MALIC_ENZYME                                           PTHR23406.SF32.pir
ENERGY_CONSERVATION  ATP  PNO/PFO                      PTHR32154.mag.mod    PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED                            PTHR32154.orig.30.pir
ENERGY_CONSERVATION  ATP  PNO/PFO                      PTHR32154.SF0.mod    PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED                            PTHR32154.SF0.pir
ENERGY_CONSERVATION  ATP  Adrenodoxin/Ferredoxin_Yah1  PTHR23426.mag.mod    FERREDOXIN/ADRENODOXIN                                                PTHR23426.orig.30.pir
ENERGY_CONSERVATION  ATP  Adrenodoxin/Ferredoxin_Yah1  PTHR23426.SF65.mod   FERREDOXIN_1-RELATED                                                  PTHR23426.SF65.pir
ENERGY_CONSERVATION  ATP  FdxR/Arh1                    PTHR11938.mag.mod    FAD_NADPH_DEHYDROGENASE/OXIDOREDUCTASE                                PTHR11938.orig.30.pir
ENERGY_CONSERVATION  ATP  FdxR/Arh1                    PTHR11938.SF91.mod   NADPH:ADRENODOXIN_OXIDOREDUCTASE,_MITOCHONDRIAL                       PTHR11938.SF91.pir
ENERGY_CONSERVATION  ATP  HydA                         PTHR11615.mag.mod    NITRATE,_FORMATE,_IRON_DEHYDROGENASE                                  PTHR11615.orig.30.pir
ENERGY_CONSERVATION  ATP  HydA-1                       PTHR11615.SF345.mod  CATALYTIC_SUBUNIT_OF_IRON-ONLY_HYDROGENASE-RELATED                    PTHR11615.SF345.pir
ENERGY_CONSERVATION  ATP  HydA-4                       PTHR11615.SF304.mod  IRON_HYDROGENASE,_PUTATIVE-RELATED                                    PTHR11615.SF304.pir
ENERGY_CONSERVATION  ATP  HydE                         PTHR43726.mag.mod    3-METHYLORNITHINE_SYNTHASE                                            PTHR43726.orig.30.pir
ENERGY_CONSERVATION  ATP  HydE                         PTHR43726.SF1.mod    [FEFE]_HYDROGENASE_MATURASE_SUBUNIT_HYDE                              PTHR43726.SF1.pir
ENERGY_CONSERVATION  ATP  HydF                         PTHR42714.mag.mod    TRNA_MODIFICATION_GTPASE_GTPBP3                                       PTHR42714.orig.30.pir
ENERGY_CONSERVATION  ATP  HydF                         PTHR42714.SF6.mod    TRANSLATION_INITIATION_FACTOR_IF-2                                    PTHR42714.SF6.pir
ENERGY_CONSERVATION  ATP  HydG                         PTHR43583.mag.mod    2-IMINOACETATE_SYNTHASE                                               PTHR43583.orig.30.pir
ENERGY_CONSERVATION  ATP  HydG                         PTHR43583.SF2.mod    BIOTIN_AND_THIAMIN_SYNTHESIS_ASSOCIATED_DOMAIN_CONTAINING_PROTEIN     PTHR43583.SF2.pir
ENERGY_CONSERVATION  ATP  NuoF/Complex_I_51kDa         PTHR11780.mag.mod    NADH-UBIQUINONE_OXIDOREDUCTASE_FLAVOPROTEIN_1_NDUFV1                  PTHR11780.orig.30.pir
ENERGY_CONSERVATION  ATP  NuoF/Complex_I_51kDa         PTHR11780.SF10.mod   NADH_DEHYDROGENASE_[UBIQUINONE]_FLAVOPROTEIN_1,_MITOCHONDRIAL         PTHR11780.SF10.pir
ENERGY_CONSERVATION  ATP  NuoE/Complex_I_24kDa         PTHR10371.mag.mod    NADH_DEHYDROGENASE_UBIQUINONE_FLAVOPROTEIN_2,_MITOCHONDRIAL           PTHR10371.orig.30.pir
ENERGY_CONSERVATION  ATP  NuoE/Complex_I_24kDa         PTHR10371.SF3.mod    NADH_DEHYDROGENASE_[UBIQUINONE]_FLAVOPROTEIN_2,_MITOCHONDRIAL         PTHR10371.SF3.pir
ENERGY_CONSERVATION  ATP  ASCT-1B                      PTHR21432.mag.mod    ACETYL-COA_HYDROLASE-RELATED                                          PTHR21432.orig.30.pir
ENERGY_CONSERVATION  ATP  ASCT-1B                      PTHR21432.SF20.mod   ACETYL-COA_DEACYLASE                                                  PTHR21432.SF20.pir
ENERGY_CONSERVATION  ATP  ASCT-1C                      PTHR43609.mag.mod    ACETYL-COA_HYDROLASE                                                  PTHR43609.orig.30.pir
ENERGY_CONSERVATION  ATP  ASCT-1C                      PTHR43609.SF1.mod    ACETYL-COA_HYDROLASE                                                  PTHR43609.SF1.pir
ENERGY_CONSERVATION  ATP  SCSa/STKa                    PTHR11117.mag.mod    SUCCINYL-COA_LIGASE_SUBUNIT_ALPHA                                     PTHR11117.orig.30.pir
ENERGY_CONSERVATION  ATP  SCSa/STKa                    PTHR11117.SF2.mod    SUCCINATE--COA_LIGASE_[ADP/GDP-FORMING]_SUBUNIT_ALPHA,_MITOCHONDRIAL  PTHR11117.SF2.pir
ENERGY_CONSERVATION  ATP  SCSb/STKb                    PTHR11815.mag.mod    SUCCINYL-COA_SYNTHETASE_BETA_CHAIN                                    PTHR11815.orig.30.pir
ENERGY_CONSERVATION  ATP  SCSb/STKb                    PTHR11815.SF10.mod   SUCCINATE--COA_LIGASE_[GDP-FORMING]_SUBUNIT_BETA,_MITOCHONDRIAL       PTHR11815.SF10.pir
ENERGY_CONSERVATION  ATP  DLDH                         PTHR43716.mag.mod    D-2-HYDROXYGLUTARATE_DEHYDROGENASE,_MITOCHONDRIAL                     PTHR43716.orig.30.pir
ENERGY_CONSERVATION  ATP  DLDH                         PTHR43716.SF1.mod    D-2-HYDROXYGLUTARATE_DEHYDROGENASE,_MITOCHONDRIAL                     PTHR43716.SF1.pir
ENERGY_CONSERVATION  ATP  CysJ-fused_HydA              PTHR19384.mag.mod    NITRIC_OXIDE_SYNTHASE-RELATED                                         PTHR19384.orig.30.pir
ENERGY_CONSERVATION  ATP  CysJ-fused_HydA              PTHR19384.SF109.mod  SULFITE_REDUCTASE_[NADPH]_FLAVOPROTEIN_COMPONENT                      PTHR19384.SF109.pir
ENERGY_CONSERVATION  ATP  CysJ-fused_HydA              PTHR19384.SF17.mod   NADPH--CYTOCHROME_P450_REDUCTASE                                      PTHR19384.SF17.pir
ENERGY_CONSERVATION  ATP  PFL                          PTHR30191.mag.mod    FORMATE_ACETYLTRANSFERASE                                             PTHR30191.orig.30.pir
ENERGY_CONSERVATION  ATP  PFL                          PTHR30191.SF0.mod    FORMATE_ACETYLTRANSFERASE_1                                           PTHR30191.SF0.pir
ENERGY_CONSERVATION  ATP  ACS                          PTHR42793.mag.mod    COA_BINDING_DOMAIN_CONTAINING_PROTEIN                                 PTHR42793.orig.30.pir
ENERGY_CONSERVATION  ATP  ACS                          PTHR42793.SF1.mod    PEPTIDYL-LYSINE_N-ACETYLTRANSFERASE_PATZ                              PTHR42793.SF1.pir
ENERGY_CONSERVATION  ATP  ACS                          PTHR43334.mag.mod    ACETATE--COA_LIGASE_[ADP-FORMING]                                     PTHR43334.orig.30.pir
ENERGY_CONSERVATION  ATP  ACS                          PTHR43334.SF1.mod    ACETATE--COA_LIGASE_[ADP-FORMING]                                     PTHR43334.SF1.pir
ENERGY_CONSERVATION  ATP  MM-CoA-epimerase             PTHR43048.mag.mod    METHYLMALONYL-COA_EPIMERASE                                           PTHR43048.orig.30.pir
ENERGY_CONSERVATION  ATP  MM-CoA-epimerase             PTHR43048.SF3.mod    METHYLMALONYL-COA_EPIMERASE,_MITOCHONDRIAL                            PTHR43048.SF3.pir
ENERGY_CONSERVATION  ATP  MM-CoA-mutase                PTHR48101.SF4.mod    METHYLMALONYL-COA_MUTASE,_MITOCHONDRIAL                               PTHR48101.SF4.pir
ENERGY_CONSERVATION  ATP  MM-CoA-mutase                PTHR48101.mag.mod    METHYLMALONYL-COA_MUTASE,_MITOCHONDRIAL-RELATED                       PTHR48101.orig.30.pir
ENERGY_CONSERVATION  ATP  MMSA-dehydrogenase           PTHR43866.mag.mod    MALONATE-SEMIALDEHYDE_DEHYDROGENASE                                   PTHR43866.orig.30.pir
ENERGY_CONSERVATION  ATP  MMSA-dehydrogenase           PTHR43866.SF3.mod    METHYLMALONATE-SEMIALDEHYDE_DEHYDROGENASE_[ACYLATING],_MITOCHONDRIAL  PTHR43866.SF3.pir
ENERGY_CONSERVATION  ATP  PCCa                         PTHR18866.mag.mod    CARBOXYLASE:PYRUVATE/ACETYL-COA/PROPIONYL-COA_CARBOXYLASE             PTHR18866.orig.30.pir
ENERGY_CONSERVATION  ATP  PCCa                         PTHR18866.SF33.mod   METHYLCROTONOYL-COA_CARBOXYLASE_SUBUNIT_ALPHA,_MITOCHONDRIAL-RELATED  PTHR18866.SF33.pir
ENERGY_CONSERVATION  ATP  PCCb                         PTHR43842.mag.mod    PROPIONYL-COA_CARBOXYLASE_BETA_CHAIN                                  PTHR43842.orig.30.pir
ENERGY_CONSERVATION  ATP  PCCb                         PTHR43842.SF2.mod    PROPIONYL-COA_CARBOXYLASE_BETA_CHAIN-RELATED                          PTHR43842.SF2.pir
ENERGY_CONSERVATION  ATP  PCCb                         PTHR45728.SF3.mod    ACETYL-COA_CARBOXYLASE                                                PTHR45728.SF3.pir
ENERGY_CONSERVATION  ATP  PCCb                         PTHR45728.mag.mod    ACETYL-COA_CARBOXYLASE,_ISOFORM_A                                     PTHR45728.orig.30.pir
ENERGY_CONSERVATION  ATP  3HB-dehydrogenase            PTHR43060.mag.mod    3-HYDROXYISOBUTYRATE_DEHYDROGENASE-LIKE_1,_MITOCHONDRIAL-RELATED      PTHR43060.orig.30.pir
ENERGY_CONSERVATION  ATP  3HB-dehydrogenase            PTHR43060.SF15.mod   3-HYDROXYISOBUTYRATE_DEHYDROGENASE-LIKE_1,_MITOCHONDRIAL-RELATED      PTHR43060.SF15.pir

NOTE that I've included for every gene the family and the subfamily identifier. So, which of these are in our genome?

cat ergo_vs_panther17.filt.ext.domtblout \
    | body rg -f <(cut -f 6 families_of_interest.ATP.upd.list) \
    | body sort -k28,28 \
    > ergo_vs_panther17.ATP.ext.domtblout

rg is a modern unix tool that is very similar to the classic grep (it is literally called ripgrep) but its a lot faster, particularly with the -f option.

We now get the following file

TargetName             TargetAcc  TargetLen  QueryName      QueryAcc  QueryLen  seq-E-value  seq-Score  seq-Bias  dom-No  dom-Count  dom-c-Evalue  dom-i-Evalue  dom-Score  dom-Bias  hmm-From  hmm-To  ali-From  ali-To  env-From  env-To  mean-PP  DescriptionOfTarget  qcovhsp  tcovhsp  qcovs  tcovs    annotation
PTHR43726.orig.30.pir  -          384        FUN_005298-T1  -         390       7.1e-100     337.2      0.0       1       1          8.8e-104      8.9e-100      336.9      0.0       45        380     18        380     5         383     0.94     -                    0.93     0.88     0.93   0.88     3-METHYLORNITHINE_SYNTHASE
PTHR43334.orig.30.pir  -          705        FUN_004883-T1  -         761       1.1e-247     826.9      0.0       1       1          1.3e-251      1.4e-247      826.6      0.0       4         702     31        733     29        736     0.98     -                    0.92     0.99     0.92   0.99     ACETATE--COA_LIGASE_[ADP-FORMING]
PTHR43334.orig.30.pir  -          705        FUN_004897-T1  -         647       1.4e-224     750.4      0.0       1       1          1.7e-228      1.8e-224      750.1      0.0       1         630     12        643     12        646     0.98     -                    0.98     0.89     0.98   0.89     ACETATE--COA_LIGASE_[ADP-FORMING]
PTHR43583.SF2.pir      -          498        FUN_005007-T1  -         570       6e-189       631.4      0.0       1       1          5.7e-193      7.3e-189      631.1      0.0       19        498     58        570     31        570     0.97     -                    0.90     0.96     0.90   0.96     BIOTIN_AND_THIAMIN_SYNTHESIS_ASSOCIATED_DOMAIN_CONTAINING_PROTEIN
PTHR43583.SF2.pir      -          498        FUN_005031-T1  -         570       7.6e-189     631.1      0.0       1       1          7.2e-193      9.3e-189      630.8      0.0       19        498     58        570     31        570     0.97     -                    0.90     0.96     0.90   0.96     BIOTIN_AND_THIAMIN_SYNTHESIS_ASSOCIATED_DOMAIN_CONTAINING_PROTEIN
PTHR11615.SF345.pir    -          585        FUN_003250-T1  -         583       5.5e-167     559.9      0.2       1       1          1.1e-170      6.3e-167      559.7      0.2       6         581     5         579     1         582     0.94     -                    0.99     0.98     0.99   0.98     CATALYTIC_SUBUNIT_OF_IRON-ONLY_HYDROGENASE-RELATED
PTHR11615.SF345.pir    -          585        FUN_003742-T1  -         461       3.9e-127     428.3      1.4       1       1          5.8e-131      6.9e-127      427.5      1.4       139       580     31        460     4         461     0.92     -                    0.93     0.76     0.93   0.76     CATALYTIC_SUBUNIT_OF_IRON-ONLY_HYDROGENASE-RELATED
PTHR42793.orig.30.pir  -          472        FUN_000469-T1  -         843       1.5e-104     353.0      0.0       1       1          5.9e-109      1.7e-104      352.9      0.0       5         460     352       842     146       843     0.93     -                    0.58     0.97     0.58   0.97     COA_BINDING_DOMAIN_CONTAINING_PROTEIN
PTHR23426.SF65.pir     -          153        FUN_000689-T1  -         116       1.5e-13      54.3       0.0       1       1          1.7e-17       1.7e-13       54.1       0.0       57        142     28        109     14        115     0.87     -                    0.71     0.56     0.71   0.56     FERREDOXIN_1-RELATED
PTHR23426.SF65.pir     -          153        FUN_001937-T1  -         106       2.1e-13      53.8       0.1       1       1          1.9e-17       2.4e-13       53.6       0.1       46        139     8         99      3         105     0.81     -                    0.87     0.61     0.87   0.61     FERREDOXIN_1-RELATED
PTHR30191.SF0.pir      -          749        FUN_003588-T1  -         742       0.0          1182.9     0.1       1       1          0.0           0.0           1182.7     0.1       20        748     4         741     1         742     0.98     -                    0.99     0.97     0.99   0.97     FORMATE_ACETYLTRANSFERASE_1
PTHR30191.SF0.pir      -          749        FUN_004683-T1  -         779       0.0          1200.2     0.0       1       1          0.0           0.0           1199.9     0.0       15        748     36        778     26        779     0.98     -                    0.95     0.98     0.95   0.98     FORMATE_ACETYLTRANSFERASE_1
PTHR30191.SF0.pir      -          749        FUN_004725-T1  -         744       0.0          1199.6     0.0       1       1          0.0           0.0           1199.4     0.0       21        748     7         743     1         744     0.98     -                    0.99     0.97     0.99   0.97     FORMATE_ACETYLTRANSFERASE_1
PTHR30191.SF0.pir      -          749        FUN_004737-T1  -         744       0.0          1199.6     0.0       1       1          0.0           0.0           1199.4     0.0       21        748     7         743     1         744     0.98     -                    0.99     0.97     0.99   0.97     FORMATE_ACETYLTRANSFERASE_1
PTHR30191.SF0.pir      -          749        FUN_006165-T1  -         744       0.0          1196.1     0.0       1       1          0.0           0.0           1195.9     0.0       22        748     8         743     3         744     0.98     -                    0.99     0.97     0.99   0.97     FORMATE_ACETYLTRANSFERASE_1
PTHR11615.SF304.pir    -          431        FUN_004893-T1  -         1202      5.6e-80      272.2      0.0       1       2          0.0045        19.0          6.9        0.8       44        65      219       257     147       271     0.7      -                    0.03     0.05     0.63   0.89     IRON_HYDROGENASE,_PUTATIVE-RELATED
PTHR11615.SF304.pir    -          431        FUN_005208-T1  -         1444      5.2e-84      285.5      0.0       1       4          1.1e-06       0.0045        18.8       5.8       44        75      288       340     187       347     0.72     -                    0.04     0.07     0.38   0.86     IRON_HYDROGENASE,_PUTATIVE-RELATED
PTHR11780.SF10.pir     -          449        FUN_001132-T1  -         414       4.3e-83      282.2      0.0       1       2          1.8e-53       3.6e-49       170.4      0.0       67        210     2         180     1         194     0.97     -                    0.43     0.32     0.74   0.59     NADH_DEHYDROGENASE_[UBIQUINONE]_FLAVOPROTEIN_1,_MITOCHONDRIAL
PTHR10371.orig.30.pir  -          230        FUN_002223-T1  -         208       7.2e-51      175.7      0.0       1       1          2.6e-55       7.4e-51       175.7      0.0       29        183     8         166     2         208     0.88     -                    0.76     0.67     0.76   0.67     NADH_DEHYDROGENASE__UBIQUINONE__FLAVOPROTEIN_2,_MITOCHONDRIAL
PTHR11780.orig.30.pir  -          449        FUN_000599-T1  -         466       8.9e-152     508.6      0.0       1       1          7.2e-156      1.1e-151      508.2      0.0       14        430     38        456     29        464     0.95     -                    0.90     0.93     0.90   0.93     NADH-UBIQUINONE_OXIDOREDUCTASE_FLAVOPROTEIN_1__NDUFV1
PTHR23406.SF32.pir     -          596        FUN_001722-T1  -         585       2.8e-186     623.3      0.0       1       1          6.5e-190      3.4e-186      623.1      0.0       43        591     20        585     2         585     0.93     -                    0.97     0.92     0.97   0.92     NADP-DEPENDENT_MALIC_ENZYME
PTHR23406.SF32.pir     -          596        FUN_003026-T1  -         563       4.2e-195     652.5      0.0       1       1          9.7e-199      5.1e-195      652.2      0.0       39        593     11        552     2         555     0.96     -                    0.96     0.93     0.96   0.93     NADP-DEPENDENT_MALIC_ENZYME
PTHR23406.SF32.pir     -          596        FUN_003030-T1  -         563       4.2e-195     652.5      0.0       1       1          9.7e-199      5.1e-195      652.2      0.0       39        593     11        552     2         555     0.96     -                    0.96     0.93     0.96   0.93     NADP-DEPENDENT_MALIC_ENZYME
PTHR32154.SF0.pir      -          1159       FUN_000496-T1  -         1228      0.0          1568.0     0.0       1       1          0.0           0.0           1567.6     0.0       3         1156    51        1226    48        1227    0.98     -                    0.96     1.00     0.96   1.00     PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED
PTHR32154.SF0.pir      -          1159       FUN_000974-T1  -         1261      0.0          1402.3     0.0       1       1          0.0           0.0           1402.0     0.0       5         1154    81        1256    78        1261    0.97     -                    0.93     0.99     0.93   0.99     PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED
PTHR32154.SF0.pir      -          1159       FUN_003906-T1  -         1427      0.0          1268.7     0.0       1       1          0.0           0.0           1264.6     0.0       5         1156    101       1383    98        1394    0.95     -                    0.90     0.99     0.90   0.99     PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED
PTHR32154.SF0.pir      -          1159       FUN_004094-T1  -         1404      0.0          1481.8     0.0       1       1          0.0           0.0           1481.4     0.0       4         1154    3         1223    1         1228    0.96     -                    0.87     0.99     0.87   0.99     PYRUVATE-FLAVODOXIN_OXIDOREDUCTASE-RELATED
PTHR11117.SF2.pir      -          320        FUN_001614-T1  -         326       6.7e-140     468.5      4.9       1       1          1.1e-143      7.9e-140      468.2      4.9       21        318     25        319     6         321     0.96     -                    0.90     0.93     0.90   0.93     SUCCINATE--COA_LIGASE_[ADP/GDP-FORMING]_SUBUNIT_ALPHA,_MITOCHONDRIAL
PTHR11815.orig.30.pir  -          402        FUN_002952-T1  -         408       3.8e-159     532.8      0.0       1       1          3.9e-163      4.6e-159      532.5      0.0       5         399     7         405     4         408     0.98     -                    0.98     0.98     0.98   0.98     SUCCINYL-COA_SYNTHETASE_BETA_CHAIN
PTHR42714.SF6.pir      -          467        FUN_004742-T1  -         491       6.2e-29      104.5      0.0       1       1          5e-31         4.2e-27       98.5       0.0       70        441     18        413     15        434     0.71     -                    0.81     0.80     0.81   0.80     TRANSLATION_INITIATION_FACTOR_IF-2

To verify that each of these proteins are truly orthologs of the PANTHER families, we'd have to make trees, but this is a good starting point.

NOTE that we purposefully started by comparing our proteins to the entire PANTHER HMM-profile database, rather than starting out from a selection of HMM-profiles of gene families that we know we are interested in. This prevents false positive associations between our protein and and HMM profile:

Let's say that our genome doesn't have an ortholog of PANTHER family A, but it does have a distant paralog of that family (represented in the PANTHER database by family B). We are interested in A, but not in B. If we search with PANTHER family A, but not with family B, we may get a significant hit against family A, even though its not a true ortholog. If we were to include family B in our search as well, then we'd get significant hits against family A and B, but the score against B would be much better. It is thus clear that our protein is not a true ortholog of family A.

cgeb2001's DokuWiki!

Table of Contents