This is an old revision of the document!
Table of Contents
Gene prediction with the Braker2 pipeline
GP using machine learning and extrinsic hints by DE Salas-Leiva (last updated Oct-21-2020)
Updated by Joran Martijn & Jason Shao in December 2022
Multiple steps need to be carried out for Gene Prediction:
Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics.
Example scripts for submitting these steps to Perun are available on the Roger Lab and ICG Github pages.
GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your home directory: id your key is older than 3 months or you are not sure about it, you are better off by downloading a new licence as follows:
go to http://topaz.gatech.edu/GeneMark/license_download.cgi and download the 64 bits version: GeneMark-ES / ET (version) and move it to perun:
then do the following commands in your home directory in perun:
gunzip gm_key_64.gz
and finally, change the file name and hide it:
mv gm_key_64 .gm_key
Repeat masking
From the BRAKER1 paper:
“Repetitive sequences create challenges for automatic gene finders both at parameter estimation step and gene prediction step. The size and quality of the training set generated by GeneMark-ET for AUGUSTUS (multi-exon genes with so called anchored introns, the introns predicted ab initio and also supported by RNA-Seq read mapping) is not significantly affected by TEs masking since TEs have not anchored introns. However, at the prediction step TEs can corrupt gene prediction. For this reason, soft masking of genomic sequence is recommended before execution of BRAKER1.”
Some repetitive elements in your genome may per-chance look like ORFs or even protein coding genes. The main purpose of masking these repeats is to prevent your gene predictor from even looking at these regions, so they will not predict any false positive genes there.
Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly.
NOTE: The -xsmall option of RepeatMasker will ensure that your output is soft-masked. This means that the repetitive regions will be presented in lower case in the output FASTA file. This is in contrast with hard-masking, which will replace repetitive sequences with Ns.
NOTE: The masked assembly is necessary for Braker, not for Hisat2
#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20
cd $PWD
original_genome=CARP_genome_july16.fasta
root=CARP_genome_july16
nthreads=20
RMdir=RepeatMasked_genome/
source activate repeatmasker+modeller-no-update
export PATH=/misc/scratch2/software/anaconda/envs/repeatmasker+modeller-no-update/custom/RepeatModeler-open-1.0.11:$PATH
mkdir RepeatMasked_genome
echo 'Modeling and Masking genome'
BuildDatabase -name $root.index $original_genome
sleep 30s
RepeatModeler -pa $nthreads -database $root.index > $root.run.out
RepeatMasker \
-pa $nthreads \
-xsmall \
-s \
-nolow \
-dir $RMdir \
-a \
-inv \
-lib $root.index-families.fa \
$original_genome
source deactivate
RNAseq mapping
RNAseq data is direct evidence of which areas of your genome are expressed. Mapping your RNAseq data to your genome with a splice-aware mapper such as Hisat2 will yield information on the starts and stops of protein coding genes, as well as, most importantly perhaps, the start and stop coordinates of introns.
On the masked assembly map the RNAseq using Hisat2, sort the output and create a depth file:
NOTE: Since our masked assembly is soft-masked, RNAseq reads originating from repetitive regions will still map here. Only if repetitive regions are hard-masked, will Hisat2 ignore them.
NOTE: Only paired reads are mapped here! If you also try to map unpaired reads, Trinity will complain.
#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 10
cd $PWD
source activate hisat2
hisat2-build -f your_masked_genome your_genome_index
hisat2 \
-q \
-p 10 \
--phred33 \
--max-intronlen 1000 \
-k 2 \
-x yourgenome_index \
-1 READS_R1 \
-2 READS_R2 \
--rna-strandness RF \
-S your_masked_genome.sam
source deactivate
samtools view -bS -o your_genome.sambam your_masked_genome.sam
samtools sort your_masked_genome.sambam -o yourmasked_genome.sambamsorted.bam
samtools index your_masked_genome.sambamsorted.bam
NOTE: The –rna-strandness option here can only be used if your RNAseq experiment was done with stranded chemistry, i.e. that we know for each read whether it came from the positive or negative strand of the genome. This option is critical if we later want to split the BAM file into two BAM files, one for positive strand read mappings, and one for negative strand read mappings. The strand information in the BAM file generated by Hisat2 is captured by the XS tag. XS:A:+ for positive strand mappings, and XS:A:- for negative strand mappings. If you don't invoke –rna-strandness but still perform the BAM split, for some reason only read mappings that were spliced during alignment are retained.
Genome-guided transcriptome assembly
Do this step only if you are going to use PASA for prediction validation, otherwise you must skip it: Assembly the RNAseq in genome guided mode with Trinity. This shell is for Strand Specific RNAseq
#!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 10 cd $PWD source activate trinity-2.11-with-workaround Trinity --CPU 10 --max_memory 100G --genome_guided_bam yourgenome.fasta.sambamsorted.bam --genome_guided_max_intron 1000 --SS_lib_type RF conda deactivate
To avoid multiple isoforms that could affect PASA performance you could create a transcript file containing only supertranscripts:
supertranscripts=/misc/scratch2/software/trinityrnaseq-Trinity-v2.6.6/Analysis/SuperTranscripts source activate trinity $supertranscripts/Trinity_gene_splice_modeler.py --trinity_fasta Trinity-GG.fasta conda deactivate
Braker2
Braker is a fully automated pipeline that makes use of the ab initio gene predictor GeneMark, RNAseq data mapping, and using the data of those two to train the machine learning algorithm of AUGUSTUS, which then promptly does a final round of gene prediction. Or something like that..
Predict genes using Genemark-ET and Augustus through braker2:
#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20
cd $PWD
source activate braker-preq
export VALIDATOR=/opt/perun/PASApipeline-2.0.2/misc_utilities/
export AUGUSTUS_BIN_PATH=/scratch2/software/braker/Augustus/bin/
export AUGUSTUS_CONFIG_PATH=/home/dsalas/Augustus-master/config/
export AUGUSTUS_SCRIPTS_PATH=/scratch2/software/braker/Augustus/scripts/
export PATH=/scratch2/software/braker/Augustus/bin:/scratch2/software/braker/Augustus/scripts:/scratch2/software/braker/BRAKER/scripts:$PATH
export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/
braker.pl \
--species=your_species \
--bam=yourmasked_genome.sambamsorted.bam \
--genome=$original_genome \
--softmasking --cores=20
Predicting gene models with PASA
PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome.
You need to specify your pasa config file. Below an example:
## templated variables to be replaced exist as <__var_name__> # database settings ## pasa will create an sqlite database in the location desired below DATABASE=/scratch3/jmartijn/ergo-genome/results/24_pasa-2.5.2/ergobibamus.sqlite ####################################################### # Parameters to specify to specific scripts in pipeline # create a key = "script_name" + ":" + "parameter" # assign a value as done above. #script validate_alignments_in_db.dbi validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=75 validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95 validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=0 #script subcluster_builder.dbi subcluster_builder.dbi:-m=50
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
# input
CONFIG='pasa.config'
GENOME='ergo_cyp_genome.fasta.masked'
TRANSCRIPTOME='Trinity-GG.fasta'
THREADS=20
source activate pasa-2.5.2
# run pasa
Launch_PASA_pipeline.pl \
--create --run \
-c $CONFIG \
-g $GENOME \
-t $TRANSCRIPTOME \
--transcribed_is_aligned_orient \
--ALIGNERS blat,gmap,minimap2 \
--CPU $THREADS
conda deactivate
Combining gene models with EVidenceModeler (EVM) (work in progress)
NOTE: EVM requires genome sequences in Fasta format and all gene structures and alignment evidences described in GFF3 format.==
Validation
Check the integrity of the gene models you wish to combine via EVM's native validator.
#!/bin/bash #$ -S /bin/bash #$ -cwd #$ -pe threaded 20 #$ -q 768G-batch # Validation script, if there are errors you'll get error messages, otherwise no output. source activate evidencemodeler # input BRAKER=braker2_augustus.hints.gff3 PASA=ergobibamus.sqlite.pasa_assemblies.gff3 $EVM_HOME/EvmUtils/gff3_gene_prediction_file_validator.pl $BRAKER $EVM_HOME/EvmUtils/gff3_gene_prediction_file_validator.pl $PASA conda deactivate
Partition
EVM requires partitioning of the genome assembly for parallelization, which greatly reduces runtime. This step would generate directories containing fragments from each gene models for each contig, as well as a comprehensive list of these fragments and their corresponding system locations.
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 768G-batch
# The genome sequences and gff3 files are partitioned based on individual contigs.
# Large contigs are segmented into smaller overlapping chunks.
# This would allow you to process the chunks in parallel.
# input
GENOME=masked_genome.fasta
BRAKER=braker2_augustus.hints.gff3
PASA=ergobibamus.sqlite.pasa_assemblies.gff3
source activate evidencemodeler
$EVM_HOME/EvmUtils/partition_EVM_inputs.pl \
--genome $GENOME \
--gene_predictions $BRAKER \
--transcript_alignments $PASA \
--segmentSize 1000000 \
--overlapSize 200000 \
--partition_listing partitions_list.out \
conda deactivate
Assigning Weights
Since gene models can contain conflicting predictions, a tab-delimited weights file needs to be created to instruct EVM of which prediction to keep.
The weights file has three columns including class, type and weight. The string for class is limited to the options shown below, but other columns can have arbitray values of the same type. Weight is in ascending order. Below is an example:
ABINITIO_PREDICTION augustus 1 ABINITIO_PREDICTION twinscan 1 ABINITIO_PREDICTION glimmerHMM 1 PROTEIN spliced_protein_alignments 1 PROTEIN genewise_protein_alignments 2 TRANSCRIPT spliced_transcript_alignments 1 TRANSCRIPT PASA_transcript_assemblies 10 OTHER_PREDICTION PASA_transdecoder 5
Generating Commands
With all the requiste input files ready, a series of commands need to be generated for running parallelly.
The file <commands.list> is generated with the below script:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 768G-batch
# Be sure to specify the weights file in full path, otherwise it complains
# The verbose option -S does not get recognized :/
# input
GENOME=masked_genome.fasta
BRAKER=braker2_augustus.hints.gff3
PASA=ergobibamus.sqlite.pasa_assemblies.gff3
WEIGHT=/misc/scratch3/jasons/07_annotation/04_evidencemodeler/weights.txt
PART=partitions_list.out
source activate evidencemodeler
$EVM_HOME/EvmUtils/write_EVM_commands.pl \
--genome $GENOME \
--gene_predictions $BRAKER \
--transcript_alignments $PASA \
--weights $WEIGHT \
--output_file_name evm.out \
--partitions $PART \
> commands.list
conda deactivate
Running Commands
Run the series of commands in parallel with the below script:
#!/bin/bash #$ -S /bin/bash #$ -cwd #$ -pe threaded 40 #$ -q 768G-batch # This script executes the commands.list parallelly source activate evidencemodeler parallel -j 40 < commands.list >& evm.log conda deactivate
If the <parallel> command does not work properly, you can invoke the script by its location relative to the $PATH:
$EVM_HOME/EvmUtils/execute_EVM_commands.pl
Combining Fragments
Recombine the fragments for each contig:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 144G-batch
# input
PART=partitions_list.out
PREV_OUT=evm.out
GENOME=masked_genome.fasta
source activate evidencemodeler
$EVM_HOME/EvmUtils/recombine_EVM_partial_outputs.pl \
--partitions $PART \
--output_file_name $PREV_OUT
conda deactivate
Converting To GFF3
Convert contigs to gff3:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 144G-batch
# input
PART=partitions_list.out
PREV_OUT=evm.out
GENOME=masked_genome.fasta
source activate evidencemodeler
$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl \
--partitions $PART \
--output $PREV_OUT \
--genome $GENOME
conda deactivate
Combining Contigs
Combine gff3 file from each contig to one file:
cat */evm.out.gff3 > EVM.all.gff3
