User Tools

Site Tools


curation_of_phylogenomic_datasets

This is an old revision of the document!


Phylogenomic analyses attempt to use genomic data to answer phylogenetic questions. Often we're asking about the shape of a species tree. How did modern day taxa diverge over their evolutionary history? What is the deepest divergence (i.e. the root) of these taxa?

To investigate these questions, we often would like to construct a so-called phylogenomic dataset. In essence, it is a collection of genes that are present in these taxa that are informative on this species tree. For example, Munoz-Gomez et al 2022 used a dataset of 108 genes present in 115 Proteobacteria and mitochondria to investigate the placement of mitochondria relative to Alphaproteobacteria.

For a set of genes to be informative on the species tree in question, their evolutionary history should ideally match that of the species tree. These genes should therefore be descendants of some ancestral gene, that was either present at the root of the species tree or one of its descendants nodes, and evolved purely vertically. Note that if the ancestral gene was newly introduced to this clade via horizontal gene transfer, it may still be informative, as long as it evolved purely vertically after that. On a special note, some methods to reconstruct species trees do allow for horizontal gene transfers (e.g PHYLDOG) and can thus use a lot more genes, but such methods are currently still computationally untractable.

Technically speaking, our set of genes should comprise an orthogroup. Defined by Toni Gabaldon & Eugene Koonin, 2013 as (paraphrasing here) that set of present-day genes that have descended from an ancestral gene that was present at the last common ancestor of all species in question or one of its descendant nodes. Within orthogroups, each possible pair of genes is either a pair orthologs (i.e., their last common ancestor was a speciation event) or a pair of in-paralogs (i.e., their last common ancestor was a duplication event). If one of the pair had underwent horizontal gene transfer at some point in its evolutionary history since its divergence with the other of the pair, and the pair's common ancestor gene was present in the LCA of all species in question or one of its descendants, it constitutes a in-xenolog. *

* NOTE: I made the term 'in-xenolog' up, but I think it makes sense.

After collecting your marker genes (or rather, their encoded protein sequences) of all taxa for your phylogenomic dataset, you need to curate your dataset.

curation_of_phylogenomic_datasets.1695049182.txt.gz · Last modified: by 134.190.232.90