gene_prediction_just_genemark
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| gene_prediction_just_genemark [2023/01/09 15:29] – created 134.190.232.140 | gene_prediction_just_genemark [2026/02/26 11:53] (current) – 129.173.242.70 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Gene prediction with just GeneMark ====== | ====== Gene prediction with just GeneMark ====== | ||
| - | Joran Martijn | + | Created by Joran Martijn |
| - | GeneMark is one of oldest gene prediction tools still in development, with papers describing the first algorithms as early as 1986, [[https:// | + | Updated by Jason Shao on February 26th, 2026. |
| - | The modern implementation | + | **GeneMark** |
| + | |||
| + | GeneMark was originally developed for prokaryotes but has since been extended to work with eukaryotes as well. We are currently | ||
| + | |||
| + | GeneMark is maintained and developed by Mark Borodovsky' | ||
| + | |||
| + | The name GeneMark may be derived from the Markov | ||
| + | |||
| + | Unfortunately the GeneMark tools are not distributed in CONDA repositories, | ||
| + | |||
| + | < | ||
| + | # use your browser to download the license key relevant to your system | ||
| + | |||
| + | # unpack and rename the file | ||
| + | gunzip gm_key_64.gz | ||
| + | mv gm_key_64 .gm_key | ||
| + | </ | ||
| + | |||
| + | The set of gene prediction algorithms relevant to Eukaryotes are collected in the GeneMark-ES suite, where E stands for Eukaryotic and S for Self-training. The suite contains GeneMark.hmm, | ||
| + | |||
| + | The different algorithms are all called with the Perl script '' | ||
| + | |||
| + | < | ||
| + | gm | ||
| + | e Eukaryote | ||
| + | s Self-training | ||
| + | p Plus | ||
| + | e Evidence | ||
| + | t Transcripts | ||
| + | a ..and.. | ||
| + | p Proteins | ||
| + | </ | ||
| + | |||
| + | |||
| + | ==== GeneMark-ES ==== | ||
| + | |||
| + | This is perhaps the most straightforward and pure //ab initio// gene prediction tool. Only the genome FASTA file is provided, and the algorithm will do its best without any external sources of evidence or training input (hence Self-training), to predict the gene start and end locations, including possible introns. | ||
| + | |||
| + | Create a conda environment for GeneMark-ES: | ||
| + | |||
| + | < | ||
| + | conda create -n genemark-es perl perl-mce perl-yaml perl-hash-merge perl-parallel-forkmanager | ||
| + | </ | ||
| + | |||
| + | Running GeneMark-ES: | ||
| + | |||
| + | < | ||
| + | source activate genemark-es | ||
| + | gmes_petap.pl --sequence < | ||
| + | </ | ||
| + | |||
| + | If your genome is fungal or fungal-like, | ||
| + | |||
| + | < | ||
| + | gmes_petap.pl --sequence < | ||
| + | </ | ||
| + | |||
| + | ==== GeneMark-ET ==== | ||
| + | |||
| + | This algorithm takes " | ||
| + | |||
| + | [[https:// | ||
| + | |||
| + | < | ||
| + | # get hints from rnaseq alignment bam file | ||
| + | bam2hints --intronsonly --minintronlen 20 --in=rnaseq_vs_genome.sort.bam --out=intron_hints.gff | ||
| + | |||
| + | # process hints | ||
| + | cat intron_hints.gff | sort -n -k4,4 | sort -s -n -k5,5 | sort -s -n -k3,3 | sort -s -k1,1 > intron_hints.sort.gff | ||
| + | join_multiple_hints.pl < intron_hints.sort.gff > hintsfile.tmp.gff | ||
| + | filterIntronsFindStrand.pl < | ||
| + | |||
| + | # run GeneMark-ET | ||
| + | gmes_petap.pl --verbose --sequence=< | ||
| + | </ | ||
| + | |||
| + | NOTE that '' | ||
| + | |||
| + | ==== Running GeneMark on perun ==== | ||
| + | |||
| + | There is no working environment on perun dedicated to GeneMark as far as I know, but braker2 calls GeneMark so the braker2 environment has all the necessary dependencies for running GeneMark as well | ||
| + | |||
| + | < | ||
| + | # | ||
| + | #$ -S /bin/bash | ||
| + | #$ -cwd | ||
| + | #$ -m bea | ||
| + | #$ -pe threaded 20 | ||
| + | |||
| + | source activate braker2 | ||
| + | |||
| + | # add gmes_petap.pl installation location to the $PATH | ||
| + | export PATH="/ | ||
| + | |||
| + | # input | ||
| + | ORIGINAL_GENOME=' | ||
| + | RNASEQ=' | ||
| + | THREADS=20 | ||
| + | |||
| + | # if you have no transcriptome data and you just want to do ab initio gene prediction | ||
| + | gmes_petap.pl --sequence $ORIGINAL_GENOME --ES --cores=$THREADS | ||
| + | |||
| + | # if you have a fungal like genome, use genemark-ES with --fungus | ||
| + | gmes_petap.pl --sequence $ORIGINAL_GENOME --ES --cores=$THREADS --fungus | ||
| + | |||
| + | # if you have transcriptome data, use genemark-ET | ||
| + | ## get hints from rnaseq alignment bam file | ||
| + | bam2hints --intronsonly --minintronlen 20 --in=$RNASEQ --out=intron_hints.gff | ||
| + | ## process hints | ||
| + | cat intron_hints.gff | sort -n -k4,4 | sort -s -n -k5,5 | sort -s -n -k3,3 | sort -s -k1,1 > intron_hints.sort.gff | ||
| + | join_multiple_hints.pl < intron_hints.sort.gff > hintsfile.tmp.gff | ||
| + | filterIntronsFindStrand.pl < | ||
| + | ## run GeneMark-ET | ||
| + | gmes_petap.pl --verbose --sequence=$ORGINAL_GENOME --ET=hintsfile.gff --et_score 10 --cores=2 | ||
| + | |||
| + | </ | ||
gene_prediction_just_genemark.1673292546.txt.gz · Last modified: by 134.190.232.140
