This is an old revision of the document!

Gene prediction with the Braker2 pipeline

GP using machine learning and extrinsic hints by DE Salas-Leiva (last updated Oct-21-2020)
Updated by Joran Martijn & Jason Shao in December 2022

Multiple steps need to be carried out for Gene Prediction:

Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics.

Example scripts for submitting these steps to Perun are available on the Roger Lab and ICG Github pages.

GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your home directory: id your key is older than 3 months or you are not sure about it, you are better off by downloading a new licence as follows:

 go to http://topaz.gatech.edu/GeneMark/license_download.cgi and download the 64 bits version:  GeneMark-ES / ET 
 (version) and move it to perun:

then do the following commands in your home directory in perun:

 gunzip gm_key_64.gz

and finally, change the file name and hide it:

 mv gm_key_64 .gm_key

Repeat masking

From the BRAKER1 paper:

“Repetitive sequences create challenges for automatic gene finders both at parameter estimation step and gene prediction step. The size and quality of the training set generated by GeneMark-ET for AUGUSTUS (multi-exon genes with so called anchored introns, the introns predicted ab initio and also supported by RNA-Seq read mapping) is not significantly affected by TEs masking since TEs have not anchored introns. However, at the prediction step TEs can corrupt gene prediction. For this reason, soft masking of genomic sequence is recommended before execution of BRAKER1.”

Some repetitive elements in your genome may per-chance look like ORFs or even protein coding genes. The main purpose of masking these repeats is to prevent your gene predictor from even looking at these regions, so they will not predict any false positive genes there.

Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly.

NOTE: The -xsmall option of RepeatMasker will ensure that your output is soft-masked. This means that the repetitive regions will be presented in lower case in the output FASTA file. This is in contrast with hard-masking, which will replace repetitive sequences with Ns.

NOTE: The masked assembly is necessary for Braker, not for Hisat2

#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20

cd $PWD

original_genome=CARP_genome_july16.fasta
root=CARP_genome_july16
nthreads=20
RMdir=RepeatMasked_genome/

source activate repeatmasker+modeller-no-update

export PATH=/misc/scratch2/software/anaconda/envs/repeatmasker+modeller-no-update/custom/RepeatModeler-open-1.0.11:$PATH

mkdir RepeatMasked_genome

echo 'Modeling and Masking genome'
BuildDatabase -name $root.index $original_genome
sleep 30s

RepeatModeler -pa $nthreads -database $root.index > $root.run.out 
RepeatMasker \
    -pa $nthreads \
    -xsmall \
    -s \
    -nolow \
    -dir $RMdir \
    -a \
    -inv \
    -lib $root.index-families.fa \
     $original_genome

source deactivate

RNAseq mapping

RNAseq data is direct evidence of which areas of your genome are expressed. Mapping your RNAseq data to your genome with a splice-aware mapper such as Hisat2 will yield information on the starts and stops of protein coding genes, as well as, most importantly perhaps, the start and stop coordinates of introns.

On the masked assembly map the RNAseq using Hisat2, sort the output and create a depth file:

NOTE: Since our masked assembly is soft-masked, RNAseq reads originating from repetitive regions will still map here. Only if repetitive regions are hard-masked, will Hisat2 ignore them.

NOTE: Only paired reads are mapped here! If you also try to map unpaired reads, Trinity will complain.

#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 10

cd $PWD
source activate hisat2

hisat2-build -f your_masked_genome your_genome_index
hisat2 \
    -q \
    -p 10 \
    --phred33 \
    --max-intronlen 1000 \
    -k 2 \
    -x yourgenome_index \
    -1 READS_R1 \
    -2 READS_R2 \
    --rna-strandness RF \
    -S your_masked_genome.sam

source deactivate
 
samtools view -bS -o your_genome.sambam your_masked_genome.sam
samtools sort your_masked_genome.sambam -o yourmasked_genome.sambamsorted.bam
samtools index your_masked_genome.sambamsorted.bam

NOTE: The –rna-strandness option here can only be used if your RNAseq experiment was done with stranded chemistry, i.e. that we know for each read whether it came from the positive or negative strand of the genome. This option is critical if we later want to split the BAM file into two BAM files, one for positive strand read mappings, and one for negative strand read mappings. The strand information in the BAM file generated by Hisat2 is captured by the XS tag. XS:A:+ for positive strand mappings, and XS:A:- for negative strand mappings. If you don't invoke –rna-strandness but still perform the BAM split, for some reason only read mappings that were spliced during alignment are retained.

Genome-guided transcriptome assembly

Do this step only if you are going to use PASA for prediction validation, otherwise you must skip it: Assembly the RNAseq in genome guided mode with Trinity. This shell is for Strand Specific RNAseq

#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 10
cd $PWD
source activate trinity-2.11-with-workaround
Trinity --CPU 10 --max_memory 100G --genome_guided_bam yourgenome.fasta.sambamsorted.bam --genome_guided_max_intron 1000 --SS_lib_type RF
conda deactivate

To avoid multiple isoforms that could affect PASA performance you could create a transcript file containing only supertranscripts:

supertranscripts=/misc/scratch2/software/trinityrnaseq-Trinity-v2.6.6/Analysis/SuperTranscripts
source activate trinity
$supertranscripts/Trinity_gene_splice_modeler.py --trinity_fasta Trinity-GG.fasta
conda deactivate

Braker2

Braker is a fully automated pipeline that makes use of the ab initio gene predictor GeneMark, RNAseq data mapping, and using the data of those two to train the machine learning algorithm of AUGUSTUS, which then promptly does a final round of gene prediction. Or something like that..

Predict genes using Genemark-ET and Augustus through braker2:

#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20

cd $PWD

source activate braker-preq

export VALIDATOR=/opt/perun/PASApipeline-2.0.2/misc_utilities/
export AUGUSTUS_BIN_PATH=/scratch2/software/braker/Augustus/bin/
export AUGUSTUS_CONFIG_PATH=/home/dsalas/Augustus-master/config/
export AUGUSTUS_SCRIPTS_PATH=/scratch2/software/braker/Augustus/scripts/
export PATH=/scratch2/software/braker/Augustus/bin:/scratch2/software/braker/Augustus/scripts:/scratch2/software/braker/BRAKER/scripts:$PATH
export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/

braker.pl \
    --species=your_species \
    --bam=yourmasked_genome.sambamsorted.bam \
    --genome=$original_genome \
    --softmasking --cores=20

Predicting gene models with PASA

PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome.

You need to specify your pasa config file. Below an example:

## templated variables to be replaced exist as <__var_name__>

# database settings
## pasa will create an sqlite database in the location desired below
DATABASE=/scratch3/jmartijn/ergo-genome/results/24_pasa-2.5.2/ergobibamus.sqlite

#######################################################
# Parameters to specify to specific scripts in pipeline
# create a key = "script_name" + ":" + "parameter"
# assign a value as done above.

#script validate_alignments_in_db.dbi
validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=75
validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95
validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=0

#script subcluster_builder.dbi
subcluster_builder.dbi:-m=50

#!/bin/bash                                                                                                                         
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20

# input
CONFIG='pasa.config'
GENOME='ergo_cyp_genome.fasta.masked'
TRANSCRIPTOME='Trinity-GG.fasta'
THREADS=20

source activate pasa-2.5.2

# run pasa
Launch_PASA_pipeline.pl \
        --create --run \
        -c $CONFIG \
        -g $GENOME \
        -t $TRANSCRIPTOME \
        --transcribed_is_aligned_orient \
        --ALIGNERS blat,gmap,minimap2 \
        --CPU $THREADS

conda deactivate

Combining gene models with EVidenceModeler (EVM) (work in progress)

NOTE: EVM requires genome sequences in Fasta format and all gene structures and alignment evidences described in GFF3 format.==

Validation

Check the integrity of the gene models you wish to combine via EVM's native validator.

#!/bin/bash                                                                                                                                                                                         
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 768G-batch

# Validation script, if there are errors you'll get error messages, otherwise no output.

source activate evidencemodeler

# input
BRAKER=braker2_augustus.hints.gff3 
PASA=ergobibamus.sqlite.pasa_assemblies.gff3

$EVM_HOME/EvmUtils/gff3_gene_prediction_file_validator.pl $BRAKER    

$EVM_HOME/EvmUtils/gff3_gene_prediction_file_validator.pl $PASA

conda deactivate

Partition

EVM requires partitioning of the genome assembly for parallelization, which greatly reduces runtime. This step would generate directories containing fragments from each gene models for each contig, as well as a comprehensive list of these fragments and their corresponding system locations.

#!/bin/bash                                                                                                                                                                                         
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 768G-batch

# The genome sequences and gff3 files are partitioned based on individual contigs.
# Large contigs are segmented into smaller overlapping chunks.
# This would allow you to process the chunks in parallel.

# input
GENOME=masked_genome.fasta 
BRAKER=braker2_augustus.hints.gff3 
PASA=ergobibamus.sqlite.pasa_assemblies.gff3

source activate evidencemodeler

$EVM_HOME/EvmUtils/partition_EVM_inputs.pl \
        --genome $GENOME \
        --gene_predictions $BRAKER \
        --transcript_alignments $PASA \
        --segmentSize 1000000 \
        --overlapSize 200000 \
        --partition_listing partitions_list.out \

conda deactivate

Assigning Weights

Since gene models can contain conflicting predictions, a tab-delimited weights file needs to be created to instruct EVM of which prediction to keep.

The weights file has three columns including class, type and weight. The string for class is limited to the options shown below, but other columns can have arbitray values of the same type. Weight is in ascending order. Below is an example:

   ABINITIO_PREDICTION      augustus                        1
   ABINITIO_PREDICTION      twinscan                        1
   ABINITIO_PREDICTION      glimmerHMM                      1
   PROTEIN                  spliced_protein_alignments      1
   PROTEIN                  genewise_protein_alignments     2
   TRANSCRIPT               spliced_transcript_alignments   1
   TRANSCRIPT               PASA_transcript_assemblies      10
   OTHER_PREDICTION         PASA_transdecoder               5

Generating Commands

With all the requiste input files ready, a series of commands need to be generated for running parallelly.

The file <commands.list> is generated with the below script:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 768G-batch

# Be sure to specify the weights file in full path, otherwise it complains
# The verbose option -S does not get recognized :/

# input
GENOME=masked_genome.fasta 
BRAKER=braker2_augustus.hints.gff3 
PASA=ergobibamus.sqlite.pasa_assemblies.gff3
WEIGHT=/misc/scratch3/jasons/07_annotation/04_evidencemodeler/weights.txt
PART=partitions_list.out

source activate evidencemodeler

$EVM_HOME/EvmUtils/write_EVM_commands.pl \
        --genome $GENOME \
        --gene_predictions $BRAKER \
        --transcript_alignments $PASA \
        --weights $WEIGHT \
        --output_file_name evm.out \
        --partitions $PART \
        > commands.list

conda deactivate

Running Commands

Run the series of commands in parallel with the below script:

#!/bin/bash                                                                                                             
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 40 
#$ -q 768G-batch

# This script executes the commands.list parallelly

source activate evidencemodeler

parallel -j 40 < commands.list >& evm.log

conda deactivate

If the <parallel> command does not work properly, you can invoke the script by its location relative to the $PATH:

$EVM_HOME/EvmUtils/execute_EVM_commands.pl

Combining Fragments

Recombine the fragments for each contig:

#!/bin/bash                                                                                                             
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 144G-batch

# input
PART=partitions_list.out
PREV_OUT=evm.out
GENOME=masked_genome.fasta

source activate evidencemodeler

$EVM_HOME/EvmUtils/recombine_EVM_partial_outputs.pl \
        --partitions $PART \
        --output_file_name $PREV_OUT

conda deactivate

Converting To GFF3

Convert contigs to gff3:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
#$ -q 144G-batch

# input
PART=partitions_list.out
PREV_OUT=evm.out
GENOME=masked_genome.fasta

source activate evidencemodeler

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl \                                                                     
        --partitions $PART \
        --output $PREV_OUT \
        --genome $GENOME

conda deactivate

Combining Contigs

Combine gff3 file from each contig to one file:

cat */evm.out.gff3 > EVM.all.gff3

cgeb2001's DokuWiki!

Table of Contents