This is an old revision of the document!

Improving gene models with fix_genes_with_false_introns.py

Joran Martijn (July 2024)

Automatic gene prediction algorithms/pipelines such as GeneMark, Braker2 and Funannotate can work quite nicely to get a good first estimate of gene models. However, it seems that a non-negligable number of gene models either contain introns that are not supported by the RNAseq data, or do not contain introns that are supported. These cases become obvious to the human eye when inspecting the gene models in context of the RNAseq data in for example IGV, GenomeView or similar tools.

Ignoring true introns or including false introns will lead to errors in gene models. These errors can be fixed manually (see also the other wiki page on curating gene models) but this can be quite tedious and extremely time consuming. In addition, manual edits can also introduce human error.

I've developed a python script, fix_genes_with_false_introns.py, that is available here on the GitHub repository of the lab that attempts to automatically fix gene models that contain intron features that are not supported by the RNAseq data.

Briefly, it iterates over all genes defined in an input GFF3 file, checks for each gene whether all of its introns are actually supported by the RNAseq data (provided as a BAM file). If at least one intron is not supported, the entire gene model is removed. Then, a region is defined, typically (but not always) between the upstream gene and the downstream gene, and for this region all supported intron locations are exctracted from the BAM file and subsequently spliced out of the region. This 'intron-less' sequence is then fed to a simple ORF finder, and the largest ORFs are selected. Introns are re-inserted, and the resulting ORFs are used to generate new gene features for the output GFF3 file.

This is a very straightforward but 'dumb' approach, as it does not make use of sophisticated Markov chain models. In my experience though, most new gene models are much better approximations of the true genes, but occassionally they will still be wrong, and will need to be curated by hand. For example, sometimes the shorter version of an ORF will correspond to the true gene, but the script will select the longer ORF nonetheless.

The main aim of this script is thus NOT to be the final output of your gene prediction pipeline, but rather to shave off a large amount of time from your eventual manual curation of the gene models.

Installation

This script makes use of python packages:

gffutils to read in, handle and output GFF3 files
pysam to read in and handle BAM files
biopython to read in and handle FASTA files
orfipy to predict ORFs

If you are planning to run this on your local computer, perhaps the easiest way to get it to run is to install all dependencies using CONDA or MAMBA:

conda create -n fix_genes_script -c bioconda gffutils pysam biopython orfipy

If you want to run it on PERUN, you'd need to find an environment that has these dependencies installed as well. As of right now, the find-supported-orfs environment works.

Necessary Input Files

You will need

The GFF3 file containing all the gene models you wish to check
The FASTA file containing the DNA sequences of the genome. Ensure that the FASTA headers / Sequence names match exactly those specified in the GFF3 file
The BAM and BAM index file (typically .bam and .bam.bai files) that contain the mapped RNAseq reads. Also here ensure that the reference sequence names match those in the GFF3 and FASTA files. The indexed BAM file is not fed directly to the script, but the script will assume it to exist and to be in the same directory as the BAM file.

Adding intron features to the GFF3

Often gene predictors like GeneMark, Braker2 and Funannotate will not explicitly report introns as distinct GFF3 features. They are instead implied. For example, if a gene feature is associated with at least two exon features, the regions in between the exon features are implied to be intron features.

My script however needs intron features to be explicitly written in the GFF3 file. So, if your file has implied but not explicit intron features, run this script first (also available on the gospel_of_andrew)

add_intron_features.py -g <your_GFF3_file> > <output_GFF3_file>

NOTE: the add_intron_features.py script may need to be adjusted. Check what version of gffutils you have installed in your environment:

conda list | grep gffutils

If it is 0.11 or lower, you don't need to do anything. If it is 0.12 or higher, comment out these lines

sorted_exon_names = sorted( i.attributes['ID'], key=natural_sort_key )

i.attributes['ID'].pop(1)

i.attributes['ID'][0] = sorted_exon_names[0].replace('exon','intron')

The PERUN environment find-supported-orfs should have 0.12, so you'll need to adjust the code.

cgeb2001's DokuWiki!

Table of Contents

Improving gene models with fix_genes_with_false_introns.py

Installation

Necessary Input Files

Adding intron features to the GFF3

Running fix_genes_with_false_introns.py