**This page is now deprecated, check out** [[gene_prediction_with_braker2_pipeline|Gene prediction with Braker2 pipeline]] ===== Gene prediction with Braker2 pipeline ===== GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\ Updated by Joran Martijn in July 2022 Multiple steps need to be carried out for Gene Prediction: Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics. Example scripts for submitting these steps to Perun are available on the [[https://github.com/RogerLab/gospel_of_andrew|Roger Lab]] and [[https://github.com/Dalhousie-ICG/icg-shared-scripts|ICG]] Github pages. GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your **home** directory: id your key is older than 3 months or you are not sure about it, you are better off by downloading a new licence as follows: go to http://topaz.gatech.edu/GeneMark/license_download.cgi and download the 64 bits version: GeneMark-ES / ET (version) and move it to perun: then do the following commands in your home directory in perun: gunzip gm_key_64.gz and finally, change the file name and hide it: mv gm_key_64 .gm_key ==== Repeat masking ==== Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly. NOTE: The **-xsmall** option of RepeatMasker will ensure that your output is soft-masked. This means that the repetitive regions will be presented in lower case in the output FASTA file. This is in contrast with hard-masking, which will replace repetitive sequences with Ns. NOTE: The masked assembly is necessary for Braker, not for Hisat2 #!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 20 cd $PWD original_genome=CARP_genome_july16.fasta root=CARP_genome_july16 nthreads=20 RMdir=RepeatMasked_genome/ source activate repeatmasker+modeller-no-update export PATH=/misc/scratch2/software/anaconda/envs/repeatmasker+modeller-no-update/custom/RepeatModeler-open-1.0.11:$PATH mkdir RepeatMasked_genome echo 'Modeling and Masking genome' BuildDatabase -name $root.index $original_genome sleep 30s RepeatModeler -pa $nthreads -database $root.index > $root.run.out RepeatMasker \ -pa $nthreads \ -xsmall \ -s \ -nolow \ -dir $RMdir \ -a \ -inv \ -lib $root.index-families.fa \ $original_genome source deactivate ==== RNAseq mapping ==== On the masked assembly map the RNAseq using Hisat2, sort the output and create a depth file: NOTE: Since our masked assembly is soft-masked, RNAseq reads originating from repetitive regions will still map here. Only if repetitive regions are hard-masked, will Hisat2 ignore them. NOTE: Only paired reads are mapped here! If you also try to map unpaired reads, Trinity will complain. #!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 10 cd $PWD source activate hisat2 hisat2-build -f your_masked_genome your_genome_index hisat2 -q -p 10 --phred33 --max-intronlen 1000 -k 2 -x yourgenome_index -1 READS_R1 -2 READS_R2 --rna-strandness RF -S your_masked_genome.sam source deactivate samtools view -bS -o your_genome.sambam your_masked_genome.sam samtools sort your masked_genome.sambam -o yourmasked_genome.sambamsorted.bam samtools index your_masked_genome.sambamsorted.bam ==== Genome-guided transcriptome assembly ==== Do this step only if you are going to use PASA for prediction validation, otherwise you must skip it: Assembly the RNAseq in genome guided mode with Trinity. This shell is for **Strand Specific RNAseq** #!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 10 cd $PWD source activate trinity-2.11-with-workaround Trinity --CPU 10 --max_memory 100G --genome_guided_bam yourgenome.fasta.sambamsorted.bam --genome_guided_max_intron 1000 --SS_lib_type RF conda deactivate To avoid multiple isoforms that could affect PASA performance you could create a transcript file containing only supertranscripts: supertranscripts=/misc/scratch2/software/trinityrnaseq-Trinity-v2.6.6/Analysis/SuperTranscripts source activate trinity $supertranscripts/Trinity_gene_splice_modeler.py --trinity_fasta Trinity-GG.fasta conda deactivate ==== Braker2 ==== Predict genes using Genemark-ET and Augustus through braker2: #!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 20 cd $PWD source activate braker-preq export VALIDATOR=/opt/perun/PASApipeline-2.0.2/misc_utilities/ export AUGUSTUS_BIN_PATH=/scratch2/software/braker/Augustus/bin/ export AUGUSTUS_CONFIG_PATH=/home/dsalas/Augustus-master/config/ export AUGUSTUS_SCRIPTS_PATH=/scratch2/software/braker/Augustus/scripts/ export PATH=/scratch2/software/braker/Augustus/bin:/scratch2/software/braker/Augustus/scripts:/scratch2/software/braker/BRAKER/scripts:$PATH export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/ braker.pl \ --species=your_species \ --bam=yourmasked_genome.sambamsorted.bam \ --genome=$original_genome \ --softmasking --cores=20 ==== Predicting gene models with PASA ==== PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome. You need to specify your pasa config file. Below an example: ## templated variables to be replaced exist as <__var_name__> # database settings ## pasa will create an sqlite database in the location desired below DATABASE=/scratch3/jmartijn/ergo-genome/results/24_pasa-2.5.2/ergobibamus.sqlite ####################################################### # Parameters to specify to specific scripts in pipeline # create a key = "script_name" + ":" + "parameter" # assign a value as done above. #script validate_alignments_in_db.dbi validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=75 validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95 validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=0 #script subcluster_builder.dbi subcluster_builder.dbi:-m=50 #!/bin/bash #$ -S /bin/bash #$ -cwd #$ -pe threaded 20 # input CONFIG='pasa.config' GENOME='ergo_cyp_genome.fasta.masked' TRANSCRIPTOME='Trinity-GG.fasta' THREADS=20 source activate pasa-2.5.2 # run pasa Launch_PASA_pipeline.pl \ --create --run \ -c $CONFIG \ -g $GENOME \ -t $TRANSCRIPTOME \ --transcribed_is_aligned_orient \ --ALIGNERS blat,gmap,minimap2 \ --CPU $THREADS conda deactivate ==== Compiling the final gene models with EvidenceModeler ====