====== Gene prediction with just GeneMark ======

Created by Joran Martijn in 2023.

Updated by Jason Shao on February 26th, 2026.

**GeneMark** is one of oldest gene prediction tools still in development, with papers describing the first algorithms as early as [[http://exon.gatech.edu/Genemark/PDF/Statistical_Patterns_in_Primary___Article.pdf|1986]], [[https://www.sciencedirect.com/science/article/pii/030326479390068N|1993 (1)]], [[https://www.sciencedirect.com/science/article/pii/009784859385004V|1993 (2)]] and [[https://academic.oup.com/nar/article/26/4/1107/2902172?login=true|1998]]. The latest update (as of January 2023, GeneMark-EP+) has been published in [[https://academic.oup.com/nargab/article/2/2/lqaa026/5836691?login=true|2022]].

GeneMark was originally developed for prokaryotes but has since been extended to work with eukaryotes as well. We are currently on the 4th Eukaryotic version. The first two versions (Lukashin and Borodovsky, unpublished) and (Tarasenko and Borodovsky, unpublished) have been cited by the third as unpublished data. The third version is published [[https://academic.oup.com/nar/article/33/20/6494/1082033?login=true|here]].

GeneMark is maintained and developed by Mark Borodovsky's Bioinformatics Lab at Georgia Tech Institute in the US.

The name GeneMark may be derived from the Markov models it uses, but also may or may not be derived from Dr. Borodovsky's first name..

Unfortunately the GeneMark tools are not distributed in CONDA repositories, but can only be downloaded from their [[http://topaz.gatech.edu/genemark/license_download.cgi|website]]. In addition, you need to download a license file and update it every 6 months in order to keep using GeneMark. This archaic way of doing things is probably a result of GeneMark's age. There are also [[https://github.com/gatech-genemark|GitHub]] pages, but they do not seem to contain all that much at this moment.

<code>
# use your browser to download the license key relevant to your system

# unpack and rename the file
gunzip gm_key_64.gz
mv gm_key_64 .gm_key
</code>

The set of gene prediction algorithms relevant to Eukaryotes are collected in the GeneMark-ES suite, where E stands for Eukaryotic and S for Self-training. The suite contains GeneMark.hmm, GeneMark-ES, GeneMark-ET and GeneMark-EP.

The different algorithms are all called with the Perl script ''gmes_petap.pl''. Run it without any additional arguments to get the help page (note that the standard ''-h'' or ''--help'' does not work here). You may think -like me- that ''gmes_petap.pl'' is quite a peculiar name. You are not wrong, but each of these letters is a short hand for the various algorithms stashed in this script:

<code>
gm   GeneMark.hmm
e    Eukaryote
s    Self-training
p    Plus
e    Evidence
t    Transcripts
a    ..and..
p    Proteins
</code>


==== GeneMark-ES ====

This is perhaps the most straightforward and pure //ab initio// gene prediction tool. Only the genome FASTA file is provided, and the algorithm will do its best without any external sources of evidence or training input (hence Self-training), to predict the gene start and end locations, including possible introns.

Create a conda environment for GeneMark-ES:

<code>
conda create -n genemark-es perl perl-mce perl-yaml perl-hash-merge perl-parallel-forkmanager
</code>

Running GeneMark-ES:

<code>
source activate genemark-es
gmes_petap.pl --sequence <genome.fasta> --ES
</code>

If your genome is fungal or fungal-like, you can also invoke the --fungus option

<code>
gmes_petap.pl --sequence <genome.fasta> --ES --fungus
</code>

==== GeneMark-ET ====

This algorithm takes "hints" from transcriptome sequencing evidence. After RNAseq reads have been quality-trimmed and aligned to the reference genome, the alignments contain information on intron locations. Reads will 'span' as it were introns. It's these introns, proven by RNAseq, that are taken on as hints by GeneMark as external information for its gene prediction algorithm.

[[https://github.com/Gaius-Augustus/BRAKER|Braker2]] uses the tool ''bam2hints'' to scan such RNAseq alignment files and processes the hints before passing them on to GeneMark:

<code>
# get hints from rnaseq alignment bam file
bam2hints --intronsonly --minintronlen 20 --in=rnaseq_vs_genome.sort.bam --out=intron_hints.gff

# process hints
cat intron_hints.gff | sort -n -k4,4 | sort -s -n -k5,5 | sort -s -n -k3,3 | sort -s -k1,1 > intron_hints.sort.gff
join_multiple_hints.pl < intron_hints.sort.gff > hintsfile.tmp.gff
filterIntronsFindStrand.pl <genome.fasta> hintsfile.tmp.gff --score > hintsfile.gff

# run GeneMark-ET
gmes_petap.pl --verbose --sequence=<genome.fasta> --ET=hintsfile.gff --et_score 10 --cores=2
</code>

NOTE that ''join_multiple_hints.pl'' doesn't really do anything if you only provide with a single hints file.

==== Running GeneMark on perun ====

There is no working environment on perun dedicated to GeneMark as far as I know, but braker2 calls GeneMark so the braker2 environment has all the necessary dependencies for running GeneMark as well

<code>
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -pe threaded 20

source activate braker2

# add gmes_petap.pl installation location to the $PATH
export PATH="/scratch2/software/gmes_linux_64-aug-2020/:$PATH"

# input
ORIGINAL_GENOME='ergo_cyp_genome.fasta.masked'
RNASEQ='rnaseq_vs_masked_ergo_cyp_genome.sort.bam'
THREADS=20

# if you have no transcriptome data and you just want to do ab initio gene prediction
gmes_petap.pl --sequence $ORIGINAL_GENOME --ES --cores=$THREADS

# if you have a fungal like genome, use genemark-ES with --fungus
gmes_petap.pl --sequence $ORIGINAL_GENOME --ES --cores=$THREADS --fungus

# if you have transcriptome data, use genemark-ET
## get hints from rnaseq alignment bam file
bam2hints --intronsonly --minintronlen 20 --in=$RNASEQ --out=intron_hints.gff
## process hints
cat intron_hints.gff | sort -n -k4,4 | sort -s -n -k5,5 | sort -s -n -k3,3 | sort -s -k1,1 > intron_hints.sort.gff
join_multiple_hints.pl < intron_hints.sort.gff > hintsfile.tmp.gff
filterIntronsFindStrand.pl <genome.fasta> hintsfile.tmp.gff --score > hintsfile.gff
## run GeneMark-ET
gmes_petap.pl --verbose --sequence=$ORGINAL_GENOME --ET=hintsfile.gff --et_score 10 --cores=2

</code>