User Tools

Site Tools


gene_prediction_framework

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
gene_prediction_framework [2022/12/06 10:41] – [Genome-guided transcriptome assembly] 134.190.232.140gene_prediction_framework [2022/12/13 15:49] (current) jason
Line 1: Line 1:
 +**This page is now deprecated, check out** [[gene_prediction_with_braker2_pipeline|Gene prediction with Braker2 pipeline]]
 +
 +===== Gene prediction with Braker2 pipeline =====
 +
 GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\ GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\
 Updated by Joran Martijn in July 2022 Updated by Joran Martijn in July 2022
Line 5: Line 9:
  
 Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics. Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics.
 +
 +Example scripts for submitting these steps to Perun are available on the [[https://github.com/RogerLab/gospel_of_andrew|Roger Lab]] and [[https://github.com/Dalhousie-ICG/icg-shared-scripts|ICG]] Github pages.
  
 GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your **home** directory: GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your **home** directory:
Line 20: Line 26:
    mv gm_key_64 .gm_key    mv gm_key_64 .gm_key
  
-===== Repeat masking =====+==== Repeat masking ====
  
 Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly. Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly.
Line 67: Line 73:
 </code> </code>
  
-===== RNAseq mapping =====+==== RNAseq mapping ====
  
  
Line 93: Line 99:
 </code>   </code>  
          
-===== Genome-guided transcriptome assembly =====+==== Genome-guided transcriptome assembly ====
  
  
Line 117: Line 123:
 </code> </code>
  
-===== Predicting gene models with PASA ===== 
  
- +==== Braker2 ====
-===== Gene prediction with Braker2 =====+
  
  
Line 130: Line 134:
 #$ -cwd #$ -cwd
 #$ -pe threaded 20 #$ -pe threaded 20
 +
 cd $PWD cd $PWD
 +
 source activate braker-preq source activate braker-preq
 +
 export VALIDATOR=/opt/perun/PASApipeline-2.0.2/misc_utilities/ export VALIDATOR=/opt/perun/PASApipeline-2.0.2/misc_utilities/
 export AUGUSTUS_BIN_PATH=/scratch2/software/braker/Augustus/bin/ export AUGUSTUS_BIN_PATH=/scratch2/software/braker/Augustus/bin/
Line 138: Line 145:
 export PATH=/scratch2/software/braker/Augustus/bin:/scratch2/software/braker/Augustus/scripts:/scratch2/software/braker/BRAKER/scripts:$PATH export PATH=/scratch2/software/braker/Augustus/bin:/scratch2/software/braker/Augustus/scripts:/scratch2/software/braker/BRAKER/scripts:$PATH
 export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/ export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/
-braker.pl --species=your_species --bam=yourmasked_genome.sambamsorted.bam --genome=$original_genome --softmasking --cores=20+ 
 +braker.pl 
 +    --species=your_species 
 +    --bam=yourmasked_genome.sambamsorted.bam 
 +    --genome=$original_genome 
 +    --softmasking --cores=20
 </code> </code>
 +
 +
 +==== Predicting gene models with PASA ====
 +
 +PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome.
 +
 +You need to specify your pasa config file. Below an example:
 +
 +<code>
 +## templated variables to be replaced exist as <__var_name__>
 +
 +# database settings
 +## pasa will create an sqlite database in the location desired below
 +DATABASE=/scratch3/jmartijn/ergo-genome/results/24_pasa-2.5.2/ergobibamus.sqlite
 +
 +#######################################################
 +# Parameters to specify to specific scripts in pipeline
 +# create a key = "script_name" + ":" + "parameter"
 +# assign a value as done above.
 +
 +#script validate_alignments_in_db.dbi
 +validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=75
 +validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95
 +validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=0
 +
 +#script subcluster_builder.dbi
 +subcluster_builder.dbi:-m=50
 +</code>
 +
 +<code>
 +#!/bin/bash                                                                                                                         
 +#$ -S /bin/bash
 +#$ -cwd
 +#$ -pe threaded 20
 +
 +# input
 +CONFIG='pasa.config'
 +GENOME='ergo_cyp_genome.fasta.masked'
 +TRANSCRIPTOME='Trinity-GG.fasta'
 +THREADS=20
 +
 +source activate pasa-2.5.2
 +
 +# run pasa
 +Launch_PASA_pipeline.pl \
 +        --create --run \
 +        -c $CONFIG \
 +        -g $GENOME \
 +        -t $TRANSCRIPTOME \
 +        --transcribed_is_aligned_orient \
 +        --ALIGNERS blat,gmap,minimap2 \
 +        --CPU $THREADS
 +
 +conda deactivate
 +</code>
 +
 +==== Compiling the final gene models with EvidenceModeler ====
 +
gene_prediction_framework.1670337694.txt.gz · Last modified: by 134.190.232.140