This is an old revision of the document!
Gene prediction with just GeneMark
Joran Martijn (January 2023)
GeneMark is one of oldest gene prediction tools still in development, with papers describing the first algorithms as early as 1986, 1993 (1), 1993 (2) and 1998. The latest update (as of January 2023, GeneMark-EP+) has been published in 2022.
GeneMark is maintained and developed by Mark Borodovsky's Bioinformatics Lab at Georgia Tech Institute in the US.
The name GeneMark may be derived from the Markov models it uses, but also may or may not be derived from Dr. Borodovsky's first name..
Unfortunately the GeneMark tools are not distributed in CONDA repositories, but can only be downloaded from their website. In addition, you need to download a license file and update it every 6 months in order to keep using GeneMark. This archaic way of doing things is probably a result of GeneMark's age. There are also GitHub pages, but they do not seem to contain all that much at this moment.
The set of gene prediction algorithms relevant to Eukaryotes are collected in the GeneMark-ES suite, where E stands for Eukaryotic and S for Self-training. The suite contains GeneMark.hmm, GeneMark-ES, GeneMark-ET and GeneMark-EP.
The different algorithms are all called with the Perl script gmes_petap.pl. Run it without any additional arguments to get the help page (note that the standard -h or –help does not work here)
GeneMark-ES
This is perhaps the most straightforward and pure ab initio gene prediction tool. Only the genome FASTA file is provided, and the algorithm will do its best without any external sources of evidence, to predict the gene start and end locations, including possible introns.
gmes_petap.pl --sequence <genome.fasta> --ES
If your genome is fungal or fungal-like, you can also invoke the –fungus option
gmes_petap.pl --sequence <genome.fasta> --ES --fungus
