User Tools

Site Tools


curation_of_phylogenomic_datasets

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
curation_of_phylogenomic_datasets [2023/09/19 11:22] – [Curation] 134.190.232.90curation_of_phylogenomic_datasets [2025/03/06 11:50] (current) 134.190.145.228
Line 1: Line 1:
-====== Construction and Curation of phylogenomic datasets ======+Joran Martijn 
 + 
 +====== Curation of phylogenomic datasets ======
  
 Phylogenomic analyses attempt to use genomic data to answer phylogenetic questions. Often we're asking about the shape of a species tree. How did modern day taxa diverge over their evolutionary history? What is the deepest divergence (i.e. the root) of these taxa? Phylogenomic analyses attempt to use genomic data to answer phylogenetic questions. Often we're asking about the shape of a species tree. How did modern day taxa diverge over their evolutionary history? What is the deepest divergence (i.e. the root) of these taxa?
Line 15: Line 17:
   * if one of the pair had underwent horizontal gene transfer at some point in its evolutionary history since its divergence with the other of the pair, and the pair's common ancestor gene was present in the LCA or one of its descendants, it constitutes an **in-xenolog**. *   * if one of the pair had underwent horizontal gene transfer at some point in its evolutionary history since its divergence with the other of the pair, and the pair's common ancestor gene was present in the LCA or one of its descendants, it constitutes an **in-xenolog**. *
  
-Typically when we construct new phylogenomic datasets, we use similarity searches such as BLAST and DIAMOND and HMMER to generate sets of genes. +Typically when we construct new phylogenomic datasets, we use similarity searches such as BLAST and DIAMOND and HMMER (sometimes in combination with Markov Clustering, or MCL, algorithms) to generate sets of genes. 
  
 This is an extremly practical approach, but can be fairly rough. Genes that are truely orthologs relative to genes that were found with BLAST may be missed if similarity searches are too stringent. On the other hand, genes that are NOT true orthologs (i.e. their divergence with the genes found with BLAST //predates// the LCA) may be falsely included if similarity searches are too loose. Such false positives are typically **out-paralogs**, i.e. they diverged by a duplication in an ancestor that //predates// the LCA, or **out-xenologs** *, i.e. they were introduced into the species tree via horizontal-gene transfer from some external donor and they diverged from the other genes in a common ancestor that //predates// the LCA. This is an extremly practical approach, but can be fairly rough. Genes that are truely orthologs relative to genes that were found with BLAST may be missed if similarity searches are too stringent. On the other hand, genes that are NOT true orthologs (i.e. their divergence with the genes found with BLAST //predates// the LCA) may be falsely included if similarity searches are too loose. Such false positives are typically **out-paralogs**, i.e. they diverged by a duplication in an ancestor that //predates// the LCA, or **out-xenologs** *, i.e. they were introduced into the species tree via horizontal-gene transfer from some external donor and they diverged from the other genes in a common ancestor that //predates// the LCA.
Line 29: Line 31:
 First of all, it is important to realize that we can not be absolutely certain in our recognition of non-orthologs. We are trying to estimate relationships between genes that have diverged over hundreds of millions of years ago, if not more. But, single gene trees can give us strong hints to single out suspect non-orthologs. First of all, it is important to realize that we can not be absolutely certain in our recognition of non-orthologs. We are trying to estimate relationships between genes that have diverged over hundreds of millions of years ago, if not more. But, single gene trees can give us strong hints to single out suspect non-orthologs.
  
-To identify **out-paralogs**, look out for genes or a clade of genes that ...+==== Identifying out-paralogs ====
  
-  * **branches with a known out-paralog** with strong support. This is one of the tricks that [[https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001365|PhyloFisher]] uses. It purposefully keeps known out-paralogs in its reference dataset to bait non-orthologs from new taxa.+Look out for genes or a clade of genes that ... 
 + 
 +  * **branches with a known out-paralog** with strong support. This is one of the tricks that [[https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001365|PhyloFisher]] uses. It purposefully keeps known out-paralogs in its reference dataset to bait out-paralogs from new taxa.
  
   * **encompass all or a large chunk of the taxonomic diversity of the species tree**. Note that even with few taxa, we can encompass a large chunk of diversity. For example if it includes a bunch of Beta- and Alpha-proteobacteria   * **encompass all or a large chunk of the taxonomic diversity of the species tree**. Note that even with few taxa, we can encompass a large chunk of diversity. For example if it includes a bunch of Beta- and Alpha-proteobacteria
Line 47: Line 51:
   * the clade of genes occurred with a similar taxonomic composition and branching pattern as in another gene tree, where it was identified as a strongly suspected out-paralogous clade. If this gene is somehow functionally related to the gene with the strongly suspected out-paralogous clade (part of the same complex or metabolic pathway), we can be fairly certain the clade in the inspected gene is also out-paralogous.   * the clade of genes occurred with a similar taxonomic composition and branching pattern as in another gene tree, where it was identified as a strongly suspected out-paralogous clade. If this gene is somehow functionally related to the gene with the strongly suspected out-paralogous clade (part of the same complex or metabolic pathway), we can be fairly certain the clade in the inspected gene is also out-paralogous.
  
-  * the gene or clade of genes have a domain composition that is distinct from all other homologs in the gene tree. An alternate domain composition may indicate a divergence before the LCA, but it could also indicate some accelerated evolution or some novel innovation of a true ortholog. So, be careful when you see this. This script [[https://github.com/RogerLab/gospel_of_andrew/blob/main/visualize_domains.py|visualize_domains.py]] can help you with this+  * the gene or clade of genes have a **domain composition** that is distinct from all other homologs in the gene tree. An alternate domain composition may indicate a divergence before the LCA, but it could also indicate some accelerated evolution or some novel innovation of a true ortholog. So, be careful when you see this. This script [[https://github.com/RogerLab/gospel_of_andrew/blob/main/visualize_domains.py|visualize_domains.py]] can help you with this 
 + 
 +  * You could **zoom out your taxonomic scope**. Pull in homologs from a larger diversity of taxa and re-infer your gene tree. For example, if you are checking the Alphaproteobacteria, you decide to pull in homologs from all Proteobacteria. If your clade of genes that you suspect of being out-paralogs branches far away from all other sequences of your original set of genes, this may be a sign that they are indeed out-paralogs. 
 + 
 +==== Identifying xenologs ==== 
 + 
 +Look out for genes or a clade of genes that ... 
 + 
 +  * Branch with strong support with taxa that according to the expected species tree should not branch closely together. This may be an **in-xenolog**, i.e. a gene that was horizontally transferred from a donor that was closely related with the taxa that it branches with.  
 +  
 +It can be quite tricky and mentally draining to look for these cases by eye. To aid with this, I am currently developing a script that compares topology of your gene tree with a reference expected species tree and highlights incongruent tree nodes. It also roots the gene tree automatically, gives distinct colors to taxonomic clades and gives an overview of the untrimmed sequence alignment. It is available here [[https://github.com/RogerLab/gospel_of_andrew/blob/main/visualize_tree_incongruencies.py|visualize_tree_incongruencies.py]] 
 + 
 +Be on the lookout for phylogenetic artefacts though. A gene that is in fact a regular ortholog may branch with strong support with an unrelated taxon, for example because they have similar taxonomic compositions, or they both have (independently) undergone accelerated rates of evolution. Phylogenetic artefacts can also occurr in single gene trees. We do NOT want to remove such genes. Alleviation of artefacts should be done after the phylogenomics dataset has been curated.
  
-  * You could zoom out your taxonomic scopeand you pull in homologs from a larger diversity of taxa. For example, if you are checking the Alphaproteobacteriayou decide to pull in homologs from all ProteobacteriaIf your clade of genes that you suspect of being out-paralogs branches far away from all other sequences of your original set of genes, this may be a sign that they are indeed out-paralogs.+  * Are situated on a longwell supported branch, that, if used for rooting the gene treeyields an ingroup with a species tree like topologyThis may indicate genes that were introduced into these taxa via horizontal gene transfer from a donor //outside// the species tree, i.e. **out-xenologs**.  
 +  
 +This pattern is pretty much identical to that of **out-paralogs** (see above)In either case, you would want to remove these genes from the phylogenomics dataset
  
 * NOTE: I made the terms 'in-xenolog' and 'out-xenolog' up, but I think it makes sense. * NOTE: I made the terms 'in-xenolog' and 'out-xenolog' up, but I think it makes sense.
    
  
curation_of_phylogenomic_datasets.1695133374.txt.gz · Last modified: by 134.190.232.90