This is an old revision of the document!
Table of Contents
This page is now deprecated, check outGene prediction with Braker2 pipeline
Gene prediction with Braker2 pipeline
GP using machine learning and extrinsic hints by DE Salas-Leiva (last updated Oct-21-2020)
Updated by Joran Martijn in July 2022
Multiple steps need to be carried out for Gene Prediction:
Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics.
Example scripts for submitting these steps to Perun are available on the Roger Lab and ICG Github pages.
GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your home directory: id your key is older than 3 months or you are not sure about it, you are better off by downloading a new licence as follows:
go to http://topaz.gatech.edu/GeneMark/license_download.cgi and download the 64 bits version: GeneMark-ES / ET (version) and move it to perun:
then do the following commands in your home directory in perun:
gunzip gm_key_64.gz
and finally, change the file name and hide it:
mv gm_key_64 .gm_key
Repeat masking
Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly.
NOTE: The -xsmall option of RepeatMasker will ensure that your output is soft-masked. This means that the repetitive regions will be presented in lower case in the output FASTA file. This is in contrast with hard-masking, which will replace repetitive sequences with Ns.
NOTE: The masked assembly is necessary for Braker, not for Hisat2
#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20
cd $PWD
original_genome=CARP_genome_july16.fasta
root=CARP_genome_july16
nthreads=20
RMdir=RepeatMasked_genome/
source activate repeatmasker+modeller-no-update
export PATH=/misc/scratch2/software/anaconda/envs/repeatmasker+modeller-no-update/custom/RepeatModeler-open-1.0.11:$PATH
mkdir RepeatMasked_genome
echo 'Modeling and Masking genome'
BuildDatabase -name $root.index $original_genome
sleep 30s
RepeatModeler -pa $nthreads -database $root.index > $root.run.out
RepeatMasker \
-pa $nthreads \
-xsmall \
-s \
-nolow \
-dir $RMdir \
-a \
-inv \
-lib $root.index-families.fa \
$original_genome
source deactivate
RNAseq mapping
On the masked assembly map the RNAseq using Hisat2, sort the output and create a depth file:
NOTE: Since our masked assembly is soft-masked, RNAseq reads originating from repetitive regions will still map here. Only if repetitive regions are hard-masked, will Hisat2 ignore them.
NOTE: Only paired reads are mapped here! If you also try to map unpaired reads, Trinity will complain.
#!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 10 cd $PWD source activate hisat2 hisat2-build -f your_masked_genome your_genome_index hisat2 -q -p 10 --phred33 --max-intronlen 1000 -k 2 -x yourgenome_index -1 READS_R1 -2 READS_R2 --rna-strandness RF -S your_masked_genome.sam source deactivate samtools view -bS -o your_genome.sambam your_masked_genome.sam samtools sort your masked_genome.sambam -o yourmasked_genome.sambamsorted.bam samtools index your_masked_genome.sambamsorted.bam
Genome-guided transcriptome assembly
Do this step only if you are going to use PASA for prediction validation, otherwise you must skip it: Assembly the RNAseq in genome guided mode with Trinity. This shell is for Strand Specific RNAseq
#!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 10 cd $PWD source activate trinity-2.11-with-workaround Trinity --CPU 10 --max_memory 100G --genome_guided_bam yourgenome.fasta.sambamsorted.bam --genome_guided_max_intron 1000 --SS_lib_type RF conda deactivate
To avoid multiple isoforms that could affect PASA performance you could create a transcript file containing only supertranscripts:
supertranscripts=/misc/scratch2/software/trinityrnaseq-Trinity-v2.6.6/Analysis/SuperTranscripts source activate trinity $supertranscripts/Trinity_gene_splice_modeler.py --trinity_fasta Trinity-GG.fasta conda deactivate
Braker2
Predict genes using Genemark-ET and Augustus through braker2:
#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20
cd $PWD
source activate braker-preq
export VALIDATOR=/opt/perun/PASApipeline-2.0.2/misc_utilities/
export AUGUSTUS_BIN_PATH=/scratch2/software/braker/Augustus/bin/
export AUGUSTUS_CONFIG_PATH=/home/dsalas/Augustus-master/config/
export AUGUSTUS_SCRIPTS_PATH=/scratch2/software/braker/Augustus/scripts/
export PATH=/scratch2/software/braker/Augustus/bin:/scratch2/software/braker/Augustus/scripts:/scratch2/software/braker/BRAKER/scripts:$PATH
export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/
braker.pl \
--species=your_species \
--bam=yourmasked_genome.sambamsorted.bam \
--genome=$original_genome \
--softmasking --cores=20
Predicting gene models with PASA
PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome.
You need to specify your pasa config file. Below an example:
## templated variables to be replaced exist as <__var_name__> # database settings ## pasa will create an sqlite database in the location desired below DATABASE=/scratch3/jmartijn/ergo-genome/results/24_pasa-2.5.2/ergobibamus.sqlite ####################################################### # Parameters to specify to specific scripts in pipeline # create a key = "script_name" + ":" + "parameter" # assign a value as done above. #script validate_alignments_in_db.dbi validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=75 validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95 validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=0 #script subcluster_builder.dbi subcluster_builder.dbi:-m=50
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe threaded 20
# input
CONFIG='pasa.config'
GENOME='ergo_cyp_genome.fasta.masked'
TRANSCRIPTOME='Trinity-GG.fasta'
THREADS=20
source activate pasa-2.5.2
# run pasa
Launch_PASA_pipeline.pl \
--create --run \
-c $CONFIG \
-g $GENOME \
-t $TRANSCRIPTOME \
--transcribed_is_aligned_orient \
--ALIGNERS blat,gmap,minimap2 \
--CPU $THREADS
conda deactivate
