This is an old revision of the document!

Gene prediction with just GeneMark

Joran Martijn (January 2023)

GeneMark is one of oldest gene prediction tools still in development, with papers describing the first algorithms as early as 1986, 1993 (1), 1993 (2) and 1998. The latest update (as of January 2023, GeneMark-EP+) has been published in 2022.

GeneMark was originally developed for prokaryotes but has since been extended to work with eukaryotes as well. We are currently on the 4th Eukaryotic version. The first two versions (Lukashin and Borodovsky, unpublished) and (Tarasenko and Borodovsky, unpublished) have been cited by the third as unpublished data. The third version is published here.

GeneMark is maintained and developed by Mark Borodovsky's Bioinformatics Lab at Georgia Tech Institute in the US.

The name GeneMark may be derived from the Markov models it uses, but also may or may not be derived from Dr. Borodovsky's first name..

Unfortunately the GeneMark tools are not distributed in CONDA repositories, but can only be downloaded from their website. In addition, you need to download a license file and update it every 6 months in order to keep using GeneMark. This archaic way of doing things is probably a result of GeneMark's age. There are also GitHub pages, but they do not seem to contain all that much at this moment.

# use your browser to download the license key relevant to your system

# unpack and rename the file
gunzip gm_key_64.gz
mv gm_key_64 .gm_key

The set of gene prediction algorithms relevant to Eukaryotes are collected in the GeneMark-ES suite, where E stands for Eukaryotic and S for Self-training. The suite contains GeneMark.hmm, GeneMark-ES, GeneMark-ET and GeneMark-EP.

The different algorithms are all called with the Perl script gmes_petap.pl. Run it without any additional arguments to get the help page (note that the standard -h or –help does not work here). You may think -like me- that gmes_petap.pl is quite a peculiar name. You are not wrong, but each of these letters is a short hand for the various algorithms stashed in this script:

gm   GeneMark.hmm
e    Eukaryote
s    Self-training
p    Plus
e    Evidence
t    Transcripts
a    ..and..
p    Proteins

GeneMark-ES

This is perhaps the most straightforward and pure ab initio gene prediction tool. Only the genome FASTA file is provided, and the algorithm will do its best without any external sources of evidence or training input (hence Self-training), to predict the gene start and end locations, including possible introns.

gmes_petap.pl --sequence <genome.fasta> --ES

If your genome is fungal or fungal-like, you can also invoke the –fungus option

gmes_petap.pl --sequence <genome.fasta> --ES --fungus

GeneMark-ET

This algorithm takes “hints” from transcriptome sequencing evidence. After RNAseq reads have been quality-trimmed and aligned to the reference genome, the alignments contain information on intron locations. Reads will 'span' as it were introns. It's these introns, proven by RNAseq, that are taken on as hints by GeneMark as external information for its gene prediction algorithm.

Braker2 uses the tool bam2hints to scan such RNAseq alignment files and processes the hints before passing them on to GeneMark:

# get hints from rnaseq alignment bam file
bam2hints --intronsonly --minintronlen 20 --in=rnaseq_vs_genome.sort.bam --out=intron_hints.gff

# process hints
cat intron_hints.gff | sort -n -k4,4 | sort -s -n -k5,5 | sort -s -n -k3,3 | sort -s -k1,1 > intron_hints.sort.gff
join_multiple_hints.pl < intron_hints.sort.gff > hintsfile.tmp.gff
filterIntronsFindStrand.pl <genome.fasta> hintsfile.tmp.gff --score > hintsfile.gff

# run GeneMark-ET
gmes_petap.pl --verbose --sequence=<genome.fasta> --ET=hintsfile.gff --et_score 10 --cores=2

NOTE that join_multiple_hints.pl doesn't really do anything if you only provide with a single hints file.

cgeb2001's DokuWiki!

Table of Contents

Gene prediction with just GeneMark

GeneMark-ES

GeneMark-ET