Differences

This shows you the differences between two versions of the page.

--- gene_prediction_framework [2022/12/06 10:41] – [Genome-guided transcriptome assembly] 134.190.232.140
+++ gene_prediction_framework [2022/12/13 15:49] (current) – jason
@@ Line 1: / Line 1: @@
+**This page is now deprecated, check out** [[gene_prediction_with_braker2_pipeline|Gene prediction with Braker2 pipeline]]
+===== Gene prediction with Braker2 pipeline =====
 GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\
 Updated by Joran Martijn in July 2022
@@ Line 5: / Line 9: @@
 Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics.
+Example scripts for submitting these steps to Perun are available on the [[https://github.com/RogerLab/gospel_of_andrew|Roger Lab]] and [[https://github.com/Dalhousie-ICG/icg-shared-scripts|ICG]] Github pages.
 GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your **home** directory:
@@ Line 20: / Line 26: @@
    mv gm_key_64 .gm_key
-===== Repeat masking =====
+==== Repeat masking ====
 Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly.
@@ Line 67: / Line 73: @@
 </code>
-===== RNAseq mapping =====
+==== RNAseq mapping ====
@@ Line 93: / Line 99: @@
 </code>
-===== Genome-guided transcriptome assembly =====
+==== Genome-guided transcriptome assembly ====
@@ Line 117: / Line 123: @@
 </code>
-===== Predicting gene models with PASA =====
+==== Braker2 ====
-===== Gene prediction with Braker2 =====
@@ Line 130: / Line 134: @@
 #$ -cwd
 #$ -pe threaded 20
 cd $PWD
 source activate braker-preq
 export VALIDATOR=/opt/perun/PASApipeline-2.0.2/misc_utilities/
 export AUGUSTUS_BIN_PATH=/scratch2/software/braker/Augustus/bin/
@@ Line 138: / Line 145: @@
 export PATH=/scratch2/software/braker/Augustus/bin:/scratch2/software/braker/Augustus/scripts:/scratch2/software/braker/BRAKER/scripts:$PATH
 export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/
-braker.pl --species=your_species --bam=yourmasked_genome.sambamsorted.bam --genome=$original_genome --softmasking --cores=20
+braker.pl \
+    --species=your_species \
+    --bam=yourmasked_genome.sambamsorted.bam \
+    --genome=$original_genome \
+    --softmasking --cores=20
 </code>
+==== Predicting gene models with PASA ====
+PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome.
+You need to specify your pasa config file. Below an example:
+<code>
+## templated variables to be replaced exist as <__var_name__>
+# database settings
+## pasa will create an sqlite database in the location desired below
+DATABASE=/scratch3/jmartijn/ergo-genome/results/24_pasa-2.5.2/ergobibamus.sqlite
+#######################################################
+# Parameters to specify to specific scripts in pipeline
+# create a key = "script_name" + ":" + "parameter"
+# assign a value as done above.
+#script validate_alignments_in_db.dbi
+validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=75
+validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95
+validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=0
+#script subcluster_builder.dbi
+subcluster_builder.dbi:-m=50
+</code>
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+#$ -cwd
+#$ -pe threaded 20
+# input
+CONFIG='pasa.config'
+GENOME='ergo_cyp_genome.fasta.masked'
+TRANSCRIPTOME='Trinity-GG.fasta'
+THREADS=20
+source activate pasa-2.5.2
+# run pasa
+Launch_PASA_pipeline.pl \
+        --create --run \
+        -c $CONFIG \
+        -g $GENOME \
+        -t $TRANSCRIPTOME \
+        --transcribed_is_aligned_orient \
+        --ALIGNERS blat,gmap,minimap2 \
+        --CPU $THREADS
+conda deactivate
+</code>
+==== Compiling the final gene models with EvidenceModeler ====