This is an old revision of the document!

Curation of phylogenomic datasets

Phylogenomic analyses attempt to use genomic data to answer phylogenetic questions. Often we're asking about the shape of a species tree. How did modern day taxa diverge over their evolutionary history? What is the deepest divergence (i.e. the root) of these taxa?

Construction

To investigate these questions, we often would like to construct a so-called phylogenomic dataset. In essence, it is a collection of genes that are present in these taxa that are informative on this species tree. For example, Munoz-Gomez et al 2022 used a dataset of 108 genes present in 115 Proteobacteria and mitochondria to investigate the placement of mitochondria relative to Alphaproteobacteria.

For a set of genes to be informative on the species tree in question, their evolutionary history should ideally match that of the species tree. These genes should therefore be descendants of some ancestral gene, that was either present at the root or last common ancestor of the species tree (henceforth the LCA) or one of its descendants nodes, and evolved purely vertically. Note that if the ancestral gene was newly introduced to this clade via horizontal gene transfer, it may still be informative, as long as it evolved purely vertically after its introduction. On a special note, some methods to reconstruct species trees do allow for horizontal gene transfers (e.g PHYLDOG) and can thus use a lot more genes, but such methods are currently still computationally untractable.

Technically speaking, our set of genes should comprise an orthogroup. Defined by Toni Gabaldon & Eugene Koonin, 2013 as (paraphrasing here) that set of present-day genes that have descended from an ancestral gene that was present in the LCA or one of its descendant nodes. Within orthogroups, each possible pair of genes is either

a pair of orthologs (i.e., their last common ancestor was a speciation event)
a pair of in-paralogs (i.e., their last common ancestor was a duplication event). The in- prefix indicates that the duplication event happened after the divergence of the LCA.
a pair of in-paralogs are in turn a pair of co-orthologs relative to another, third gene, if the last common ancestor of the three was a speciation event.
if one of the pair had underwent horizontal gene transfer at some point in its evolutionary history since its divergence with the other of the pair, and the pair's common ancestor gene was present in the LCA or one of its descendants, it constitutes an in-xenolog. *

Typically when we construct new phylogenomic datasets, we use similarity searches such as BLAST and DIAMOND and HMMER (sometimes in combination with Markov Clustering, or MCL, algorithms) to generate sets of genes.

This is an extremly practical approach, but can be fairly rough. Genes that are truely orthologs relative to genes that were found with BLAST may be missed if similarity searches are too stringent. On the other hand, genes that are NOT true orthologs (i.e. their divergence with the genes found with BLAST predates the LCA) may be falsely included if similarity searches are too loose. Such false positives are typically out-paralogs, i.e. they diverged by a duplication in an ancestor that predates the LCA, or out-xenologs *, i.e. they were introduced into the species tree via horizontal-gene transfer from some external donor and they diverged from the other genes in a common ancestor that predates the LCA.

To minimize the number of missing true orthologs, you can use permissive E-values in your BLAST / DIAMOND searches, or you can use more sensitive searching algorithms such as PSI-BLAST or HMMER.

To minimize the number of out-paralogs and out-xenologs (non-orthologs), it may be useful to set alignment overlap criteria in your similarity searches. For example, if a target sequence aligns well, but only covers about 10% of the query sequence, it could constitute a non-ortholog. With BLAST / PSI-BLAST, I like to use the qcovhsp metric, which states what percentage of the query was covered by the alignment. HMMER unfortunately does not report such metrics out of the box, but my script extend_hmmer_domtblout.py will add this.

Curation

By far the best way to detect non-orthologs though, is by inferring preliminary single gene trees. The information that can be extracted from a visual tree is absolutely crucial. Unfortunately, going over several dozen single gene trees by hand, and inspecting the trees by eye is an extremely mentally draining process. Yet, it is extremely important. There are several examples of published but wrong trees where the source of the error lay with the inclusion of numerous non-orthologs. I have also seen this first hand in datasets of my own! To reduce the mental pain and also perhaps increase reproducibility of such processes, I've formulated a couple of guidelines.

First of all, it is important to realize that we can not be absolutely certain in our recognition of non-orthologs. We are trying to estimate relationships between genes that have diverged over hundreds of millions of years ago, if not more. But, single gene trees can give us strong hints to single out suspect non-orthologs.

Identifying out-paralogs

Look out for genes or a clade of genes that …

branches with a known out-paralog with strong support. This is one of the tricks that PhyloFisher uses. It purposefully keeps known out-paralogs in its reference dataset to bait out-paralogs from new taxa.

encompass all or a large chunk of the taxonomic diversity of the species tree. Note that even with few taxa, we can encompass a large chunk of diversity. For example if it includes a bunch of Beta- and Alpha-proteobacteria

is situated on a very long branch that is highly supported. The long branch indicates that the divergence of these genes with all other genes occurred before the emergence of the LCA.

the topology flowing out from this branch is oddly similar to the expected species tree

if you use this clade as an outgroup to root the gene tree, the ingroup also becomes reminiscent of the expected species tree. Since the out-paralogs diverged from all other genes before the emergence of the LCA, they are the natural outgroup to root the gene tree with

If all or many of these conditions are met, we can be quite certain that we are dealing with a clade of out-paralogs.

Some other patterns to look out for, but that are not as strong as the ones above:

the clade of genes occurred with a similar taxonomic composition and branching pattern as in another gene tree, where it was identified as a strongly suspected out-paralogous clade. If this gene is somehow functionally related to the gene with the strongly suspected out-paralogous clade (part of the same complex or metabolic pathway), we can be fairly certain the clade in the inspected gene is also out-paralogous.

the gene or clade of genes have a domain composition that is distinct from all other homologs in the gene tree. An alternate domain composition may indicate a divergence before the LCA, but it could also indicate some accelerated evolution or some novel innovation of a true ortholog. So, be careful when you see this. This script visualize_domains.py can help you with this

You could zoom out your taxonomic scope. Pull in homologs from a larger diversity of taxa and re-infer your gene tree. For example, if you are checking the Alphaproteobacteria, you decide to pull in homologs from all Proteobacteria. If your clade of genes that you suspect of being out-paralogs branches far away from all other sequences of your original set of genes, this may be a sign that they are indeed out-paralogs.

Identifying xenologs

Look out for genes or a clade of genes that …

Branch with strong support with taxa that according to the expected species tree should not branch closely together. This may be an in-xenolog, i.e. a gene that was horizontally transferred from a donor that was closely related with the taxa that it branches with.

It can be quite tricky and mentally draining to look for these cases by eye. To aid with this, I am currently developing a script that compares topology of your gene tree with a reference expected species tree and highlights incongruent tree nodes. It also roots the gene tree automatically, gives distinct colors to taxonomic clades and gives an overview of the untrimmed sequence alignment. It is available here visualize_tree_incongruencies.py

Be on the lookout for phylogenetic artefacts though. A gene that is in fact a regular ortholog may branch with strong support with an unrelated taxon, for example because they have similar taxonomic compositions, or they both have (independently) undergone accelerated rates of evolution. Phylogenetic artefacts can also occurr in single gene trees. We do NOT want to remove such genes. Alleviation of artefacts should be done after the phylogenomics dataset has been curated.

Are situated on a long, well supported branch, that, if used for rooting the gene tree, yields an ingroup with a species tree like topology. This may indicate genes that were introduced into these taxa via horizontal gene transfer from a donor outside the species tree, i.e. out-xenologs.

This pattern is pretty much identical to that of out-paralogs (see above). In either case, you would want to remove these genes from the phylogenomics dataset

* NOTE: I made the terms 'in-xenolog' and 'out-xenolog' up, but I think it makes sense.

cgeb2001's DokuWiki!

Table of Contents

Curation of phylogenomic datasets

Construction

Curation

Identifying out-paralogs

Identifying xenologs