This is an old revision of the document!
Table of Contents
Gene prediction with find_supported_orfs.py
By Joran Martijn (Augustus 2025)
Ab initio gene prediction tools use only the DNA sequence as a source of information. This makes it very practical, but most likely also fairly inaccurate.
On the other hand, pipeline tools such as BRAKER and Funannotate use ab initio tools in conjunction with RNAseq data and optionally also protein homology data. This generally leads to more accurate gene models.
However, in our experience, the pipeline tools often predict introns in locations where RNAseq data strongly suggests there should not be any introns. I'm not entirely sure why this happens, but I suspect that it is due to how the RNAseq data is used. In these tools, the RNAseq data is used to train or update the Hidden Markov Models (HMMs), and then it is these updated HMMs that are used to predict genes in a manner similar to that of ab initio tools. Hence, after these HMMs are updated, exact intron locations available in the RNAseq data, are 'forgotten' when the actual gene prediction occurs.
This prompted me to develop a more straightforward python script that directly uses intron locations available in the RNAseq data as it predicts genes. The flipside is that it won't be able to predict genes in areas of the genome that are not expressed, and thus not available in the RNAseq data. I callec this script find_supported_orfs.py. It's pretty dumb, in the sense that it doesn't make use of fancy Hidden Markov Models or other types of sophisticated statistical tricks.
find_supported_orfs.py in a nutshell
1. Load BAM file representing the RNAseq data aligned to the genome sequence
2. Load FASTA file representing the genome sequence
3. Interrogate the RNAseq BAM file and extract the locations of all supported introns
An intron is considered supported when…
- Have at least 5 reads supporting the existence of this intron
- Must be within a certain length range (set by the user)
- Must have specific intron boundaries (e.g. GT-AG, or AT-AG; set by the user)
- At least 40% of the reads mapping here must support this intron
4. Interrogate the RNAseq BAM file to identify all regions that have high RNAseq coverage
A brief description of this high coverage region finding algorithm:
- Iterates over each position in the genome
- Calculate RNAseq coverage for that position (positions supported as introns are also considered 'covered')
- Once a position exceeds a coverage threshold (8 by default), it marks this position as the start of the high coverage region
- Every new position with at least 8 coverage gets added to the high coverage region
- If it encounters a position with less than 8 coverage, it does not immediately mark the end of the high coverage region.
- If the coverage remains low for a set number of positions (i.e. the user-defined 'window size'), it will then mark the last known high coverage position as the end of the high coverage region.
- If the coverage returns to high levels after a brief dip, it will keep extending the high coverage region.
- Finally, 25 bps are added to the beginning and the end of the high coverage region. Thus extending the region by 50 bp in total. This “buffer” makes the gene finding a bit more sensitive.
Currently, a high coverage region must be at least 150 bp in length to move on to the next stage
5. High coverage regions are matched with supported introns and spliced accordingly
If any introns in this region are found to be immediately adjacent to each other, or overlapping, the set of introns is reduced to a non-overlapping one, where the most supported introns are preferred.
6. Open reading frames are identified in this high coverage region
Each ORF must be 300 - 10000 bp long. If two ORFs overlap (for example two ORFs covering the same gene but with a different start codon), the longest is preferred.
7. Introns are then re-inserted and coordinates are adjusted accordingly
8. The genes are then printed in GFF3 format
Blastocystis
Many Blastocystis lineages contain protein coding genes that do not end with a canonical STOP codon. Instead, a gene can end with 'T', 'TA', or 'TG'. Then, when the polyA tail is added during mRNA maturation, the STOP codon is completed.
Run of the mill gene predictors are unaware of this peculiarity, and will thus miss these genes. As some Blastocystis genomes can have up to 30-40% of such genes, this is a pretty major problem.
find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal 'T', 'TA' or 'TG' is part of a conserved TxxxxTGTTTGTT motif, where 'x' can be any base.
