Differences

This shows you the differences between two versions of the page.

--- gene_prediction_with_braker2_pipeline [2024/10/28 10:05] – 134.190.221.230
+++ gene_prediction_with_braker2_pipeline [2025/11/18 14:23] (current) – [Genome-guided transcriptome assembly] 134.190.191.148
@@ Line 1: / Line 1: @@
-====== Gene prediction with Braker2 ======
+====== Gene prediction with the Braker2 pipeline======
 GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\
@@ Line 25: / Line 25: @@
 ===== Repeat masking =====
+From the BRAKER1 paper:
+"Repetitive sequences create challenges for automatic gene finders both at parameter estimation step and gene prediction step. The size and quality of the training set generated by GeneMark-ET for AUGUSTUS (multi-exon genes with so called anchored introns, the introns predicted //ab initio// and also supported by RNA-Seq read mapping) is not significantly affected by TEs masking since TEs have not anchored introns. However, at the prediction step TEs can corrupt gene prediction. For this reason, soft masking of genomic sequence is recommended before execution of BRAKER1."
 Some repetitive elements in your genome may per-chance look like ORFs or even protein coding genes. The main purpose of masking these repeats is to prevent your gene predictor from even looking at these regions, so they will not predict any false positive genes there.
@@ Line 126: / Line 130: @@
 #$ -cwd
 #$ -pe threaded 10
 cd $PWD
 source activate trinity-2.11-with-workaround
-Trinity --CPU 10 --max_memory 100G --genome_guided_bam yourgenome.fasta.sambamsorted.bam --genome_guided_max_intron 1000 --SS_lib_type RF
+Trinity \
+    --CPU 10 \
+    --max_memory 100G \
+    --genome_guided_bam yourgenome.fasta.sambamsorted.bam \
+    --genome_guided_max_intron 1000 \
+    --SS_lib_type RF
 conda deactivate
 </code>
@@ Line 143: / Line 156: @@
 ===== Braker2 =====
-[[https://github.com/Gaius-Augustus/BRAKER|Braker]] is a fully automated pipeline that makes use of the ab initio gene predictor GeneMark, RNAseq data mapping, and using the data of those two to train the machine learning algorithm of AUGUSTUS, which then promptly does a final round of gene prediction. Or something like that..
+[[https://github.com/Gaius-Augustus/BRAKER|Braker]] is a fully automated pipeline in which
+  - Intron start and end coordinates (//intron hints//) are extracted from the RNAseq BAM file
+  - These are then used along with the genome FASTA file to train GeneMarkET
+  - The trained GeneMarkET performs an "//ab initio//" gene prediction
+  - Those predicted gene structures for which all introns are supported by the RNAseq data (//anchored introns//) are selected to train AUGUSTUS
+  - The trained AUGUSTUS now predicts gene structures using again the intron hints as "extrinsic evidence"
+{{::braker1_pipeline.png|}}
+The intron hints are extracted using a the ''bam2hints'' tool, with flag ''--intronsonly'', which comes with AUGUSTUS and BRAKER tools.
+If you only use RNAseq as extrinsic evidence, you essentially can only use //donor splice site// and //acceptor splice site// hints. If you also have protein homology information, you can also infer and use //start//, //stop//, //exonpart// and //exon// hints (Stanke et al 2006)
+The intron hints contain explicit location information and influence the optimal path through the GHMM machine (AUGUSTUS+ paper, Stanke et al 2006). It is important to note that since this is a probabilistic model, **hints can sometimes be ignored if the intrinsic information is strong enough!**
 Predict genes using Genemark-ET and Augustus through braker2: