User Tools

Site Tools


gene_prediction_just_genemark

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
gene_prediction_just_genemark [2023/01/09 16:09] 134.190.232.140gene_prediction_just_genemark [2026/02/26 11:53] (current) 129.173.242.70
Line 1: Line 1:
 ====== Gene prediction with just GeneMark ====== ====== Gene prediction with just GeneMark ======
  
-Joran Martijn (January 2023)+Created by Joran Martijn in 2023.
  
-**GeneMark** is one of oldest gene prediction tools still in development, with papers describing the first algorithms as early as 1986, [[https://www.sciencedirect.com/science/article/pii/030326479390068N|1993 (1)]], [[https://www.sciencedirect.com/science/article/pii/009784859385004V|1993 (2)]] and [[https://academic.oup.com/nar/article/26/4/1107/2902172?login=true|1998]]. The latest update (as of January 2023, GeneMark-EP+) has been published in [[https://academic.oup.com/nargab/article/2/2/lqaa026/5836691?login=true|2022]].+Updated by Jason Shao on February 26th, 2026. 
 + 
 +**GeneMark** is one of oldest gene prediction tools still in development, with papers describing the first algorithms as early as [[http://exon.gatech.edu/Genemark/PDF/Statistical_Patterns_in_Primary___Article.pdf|1986]], [[https://www.sciencedirect.com/science/article/pii/030326479390068N|1993 (1)]], [[https://www.sciencedirect.com/science/article/pii/009784859385004V|1993 (2)]] and [[https://academic.oup.com/nar/article/26/4/1107/2902172?login=true|1998]]. The latest update (as of January 2023, GeneMark-EP+) has been published in [[https://academic.oup.com/nargab/article/2/2/lqaa026/5836691?login=true|2022]]. 
 + 
 +GeneMark was originally developed for prokaryotes but has since been extended to work with eukaryotes as well. We are currently on the 4th Eukaryotic version. The first two versions (Lukashin and Borodovsky, unpublished) and (Tarasenko and Borodovsky, unpublished) have been cited by the third as unpublished data. The third version is published [[https://academic.oup.com/nar/article/33/20/6494/1082033?login=true|here]].
  
 GeneMark is maintained and developed by Mark Borodovsky's Bioinformatics Lab at Georgia Tech Institute in the US. GeneMark is maintained and developed by Mark Borodovsky's Bioinformatics Lab at Georgia Tech Institute in the US.
Line 10: Line 14:
  
 Unfortunately the GeneMark tools are not distributed in CONDA repositories, but can only be downloaded from their [[http://topaz.gatech.edu/genemark/license_download.cgi|website]]. In addition, you need to download a license file and update it every 6 months in order to keep using GeneMark. This archaic way of doing things is probably a result of GeneMark's age. There are also [[https://github.com/gatech-genemark|GitHub]] pages, but they do not seem to contain all that much at this moment. Unfortunately the GeneMark tools are not distributed in CONDA repositories, but can only be downloaded from their [[http://topaz.gatech.edu/genemark/license_download.cgi|website]]. In addition, you need to download a license file and update it every 6 months in order to keep using GeneMark. This archaic way of doing things is probably a result of GeneMark's age. There are also [[https://github.com/gatech-genemark|GitHub]] pages, but they do not seem to contain all that much at this moment.
 +
 +<code>
 +# use your browser to download the license key relevant to your system
 +
 +# unpack and rename the file
 +gunzip gm_key_64.gz
 +mv gm_key_64 .gm_key
 +</code>
  
 The set of gene prediction algorithms relevant to Eukaryotes are collected in the GeneMark-ES suite, where E stands for Eukaryotic and S for Self-training. The suite contains GeneMark.hmm, GeneMark-ES, GeneMark-ET and GeneMark-EP. The set of gene prediction algorithms relevant to Eukaryotes are collected in the GeneMark-ES suite, where E stands for Eukaryotic and S for Self-training. The suite contains GeneMark.hmm, GeneMark-ES, GeneMark-ET and GeneMark-EP.
Line 29: Line 41:
 ==== GeneMark-ES ==== ==== GeneMark-ES ====
  
-This is perhaps the most straightforward and pure //ab initio// gene prediction tool. Only the genome FASTA file is provided, and the algorithm will do its best without any external sources of evidence, to predict the gene start and end locations, including possible introns.+This is perhaps the most straightforward and pure //ab initio// gene prediction tool. Only the genome FASTA file is provided, and the algorithm will do its best without any external sources of evidence or training input (hence Self-training), to predict the gene start and end locations, including possible introns. 
 + 
 +Create a conda environment for GeneMark-ES: 
 + 
 +<code> 
 +conda create -n genemark-es perl perl-mce perl-yaml perl-hash-merge perl-parallel-forkmanager 
 +</code> 
 + 
 +Running GeneMark-ES:
  
 <code> <code>
 +source activate genemark-es
 gmes_petap.pl --sequence <genome.fasta> --ES gmes_petap.pl --sequence <genome.fasta> --ES
 </code> </code>
Line 41: Line 62:
 </code> </code>
  
 +==== GeneMark-ET ====
  
 +This algorithm takes "hints" from transcriptome sequencing evidence. After RNAseq reads have been quality-trimmed and aligned to the reference genome, the alignments contain information on intron locations. Reads will 'span' as it were introns. It's these introns, proven by RNAseq, that are taken on as hints by GeneMark as external information for its gene prediction algorithm.
  
 +[[https://github.com/Gaius-Augustus/BRAKER|Braker2]] uses the tool ''bam2hints'' to scan such RNAseq alignment files and processes the hints before passing them on to GeneMark:
 +
 +<code>
 +# get hints from rnaseq alignment bam file
 +bam2hints --intronsonly --minintronlen 20 --in=rnaseq_vs_genome.sort.bam --out=intron_hints.gff
 +
 +# process hints
 +cat intron_hints.gff | sort -n -k4,4 | sort -s -n -k5,5 | sort -s -n -k3,3 | sort -s -k1,1 > intron_hints.sort.gff
 +join_multiple_hints.pl < intron_hints.sort.gff > hintsfile.tmp.gff
 +filterIntronsFindStrand.pl <genome.fasta> hintsfile.tmp.gff --score > hintsfile.gff
 +
 +# run GeneMark-ET
 +gmes_petap.pl --verbose --sequence=<genome.fasta> --ET=hintsfile.gff --et_score 10 --cores=2
 +</code>
 +
 +NOTE that ''join_multiple_hints.pl'' doesn't really do anything if you only provide with a single hints file.
 +
 +==== Running GeneMark on perun ====
 +
 +There is no working environment on perun dedicated to GeneMark as far as I know, but braker2 calls GeneMark so the braker2 environment has all the necessary dependencies for running GeneMark as well
 +
 +<code>
 +#!/bin/bash
 +#$ -S /bin/bash
 +#$ -cwd
 +#$ -m bea
 +#$ -pe threaded 20
 +
 +source activate braker2
 +
 +# add gmes_petap.pl installation location to the $PATH
 +export PATH="/scratch2/software/gmes_linux_64-aug-2020/:$PATH"
 +
 +# input
 +ORIGINAL_GENOME='ergo_cyp_genome.fasta.masked'
 +RNASEQ='rnaseq_vs_masked_ergo_cyp_genome.sort.bam'
 +THREADS=20
 +
 +# if you have no transcriptome data and you just want to do ab initio gene prediction
 +gmes_petap.pl --sequence $ORIGINAL_GENOME --ES --cores=$THREADS
 +
 +# if you have a fungal like genome, use genemark-ES with --fungus
 +gmes_petap.pl --sequence $ORIGINAL_GENOME --ES --cores=$THREADS --fungus
 +
 +# if you have transcriptome data, use genemark-ET
 +## get hints from rnaseq alignment bam file
 +bam2hints --intronsonly --minintronlen 20 --in=$RNASEQ --out=intron_hints.gff
 +## process hints
 +cat intron_hints.gff | sort -n -k4,4 | sort -s -n -k5,5 | sort -s -n -k3,3 | sort -s -k1,1 > intron_hints.sort.gff
 +join_multiple_hints.pl < intron_hints.sort.gff > hintsfile.tmp.gff
 +filterIntronsFindStrand.pl <genome.fasta> hintsfile.tmp.gff --score > hintsfile.gff
 +## run GeneMark-ET
 +gmes_petap.pl --verbose --sequence=$ORGINAL_GENOME --ET=hintsfile.gff --et_score 10 --cores=2
 +
 +</code>
gene_prediction_just_genemark.1673294980.txt.gz · Last modified: by 134.190.232.140