This is an old revision of the document!
Table of Contents
Construction and Curation of phylogenomic datasets
Phylogenomic analyses attempt to use genomic data to answer phylogenetic questions. Often we're asking about the shape of a species tree. How did modern day taxa diverge over their evolutionary history? What is the deepest divergence (i.e. the root) of these taxa?
Construction
To investigate these questions, we often would like to construct a so-called phylogenomic dataset. In essence, it is a collection of genes that are present in these taxa that are informative on this species tree. For example, Munoz-Gomez et al 2022 used a dataset of 108 genes present in 115 Proteobacteria and mitochondria to investigate the placement of mitochondria relative to Alphaproteobacteria.
For a set of genes to be informative on the species tree in question, their evolutionary history should ideally match that of the species tree. These genes should therefore be descendants of some ancestral gene, that was either present at the root of the species tree or one of its descendants nodes, and evolved purely vertically. Note that if the ancestral gene was newly introduced to this clade via horizontal gene transfer, it may still be informative, as long as it evolved purely vertically after that. On a special note, some methods to reconstruct species trees do allow for horizontal gene transfers (e.g PHYLDOG) and can thus use a lot more genes, but such methods are currently still computationally untractable.
Technically speaking, our set of genes should comprise an orthogroup. Defined by Toni Gabaldon & Eugene Koonin, 2013 as (paraphrasing here) that set of present-day genes that have descended from an ancestral gene that was present at the last common ancestor of all species in question or one of its descendant nodes. Within orthogroups, each possible pair of genes is either
- a pair of orthologs (i.e., their last common ancestor was a speciation event)
- a pair of in-paralogs (i.e., their last common ancestor was a duplication event)
- a pair of in-paralogs are in turn a pair of co-orthologs relative to another, third gene, if the last common ancestor of the three was a speciation event.
- if one of the pair had underwent horizontal gene transfer at some point in its evolutionary history since its divergence with the other of the pair, and the pair's common ancestor gene was present in the LCA of all species in question or one of its descendants, it constitutes a in-xenolog. *
Typically when we construct new phylogenomic datasets, we use similarity searches such as BLAST and DIAMOND and HMMER to generate sets of genes.
This is an extremly practical approach, but can be fairly rough. Genes that are truely orthologs relative to genes that were found with BLAST may be missed if similarity searches are too stringent. On the other hand, genes that are NOT true orthologs (i.e. their divergence with the genes found with BLAST predates the last common ancestor of the species in question) may be falsely included if similarity searches are too loose. Such false positives are typically out-paralogs, i.e. they diverged by a duplication in an ancestor that predates the last common ancestor of the group, or out-xenologs *, i.e. they were introduced into the species tree via horizontal-gene transfer from some external donor and they diverged from the other genes in a common ancestor that predates the last common ancestor of the group.
To minimize the number of missing true orthologs, you can use permissive E-values in your BLAST / DIAMOND searches, or you can use more sensitive searching algorithms such as PSI-BLAST or HMMER.
To minimize the number of out-paralogs and out-xenologs (non-orthologs), it may be useful to set alignment overlap criteria in your similarity searches. For example, if a target sequence aligns well, but only covers about 10% of the query sequence, it could constitute a non-ortholog. With BLAST / PSI-BLAST, I like to use the qcovhsp metric, which states what percentage of the query was covered by the alignment. HMMER unfortunately does not report such metrics out of the box, but my script extend_hmmer_domtblout.py will add this.
Curation
By far the best way to detect non-orthologs though, is by inferring preliminary single gene trees. The information that can be extracted from a visual tree is absolutely crucial. Unfortunately, going over several dozen single gene trees by hand, and inspecting the trees by eye is an extremely mentally draining process. Yet, it is extremely important. There are several examples of published but wrong trees where the source of the error lay with the inclusion of numerous non-orthologs. I have also seen this first hand in datasets of my own! To reduce the mental pain and also perhaps increase reproducibility of such processes, I've formulated a couple of guidelines.
* NOTE: I made the terms 'in-xenolog' and 'out-xenolog' up, but I think it makes sense.
