This is an old revision of the document!

Gene prediction with find_supported_orfs.py

By Joran Martijn (Augustus 2025)

Ab initio gene prediction tools use only the DNA sequence as a source of information. This makes it very practical, but most likely also fairly inaccurate.

On the other hand, pipeline tools such as BRAKER and Funannotate use ab initio tools in conjunction with RNAseq data and optionally also protein homology data. This generally leads to more accurate gene models.

However, in our experience, the pipeline tools often predict introns in locations where RNAseq data strongly suggests there should not be any introns. I'm not entirely sure why this happens, but I suspect that it is due to how the RNAseq data is used. In these tools, the RNAseq data is used to train or update the Hidden Markov Models (HMMs), and then it is these updated HMMs that are used to predict genes in a manner similar to that of ab initio tools. Hence, after these HMMs are updated, exact intron locations available in the RNAseq data, are 'forgotten' when the actual gene prediction occurs.

This prompted me to develop a more straightforward python script that directly uses intron locations available in the RNAseq data as it predicts genes. The flipside is that it won't be able to predict genes in areas of the genome that are not expressed, and thus not available in the RNAseq data.