gene_prediction_framework
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| gene_prediction_framework [2022/12/06 10:41] – [Genome-guided transcriptome assembly] 134.190.232.140 | gene_prediction_framework [2022/12/13 15:49] (current) – jason | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | **This page is now deprecated, check out** [[gene_prediction_with_braker2_pipeline|Gene prediction with Braker2 pipeline]] | ||
| + | |||
| + | ===== Gene prediction with Braker2 pipeline ===== | ||
| + | |||
| GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\ | GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\ | ||
| Updated by Joran Martijn in July 2022 | Updated by Joran Martijn in July 2022 | ||
| Line 5: | Line 9: | ||
| Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics. | Think about whether you should mask your genome. Masking is used to hide repeat regions, things like transposons. Protist genomes tend to be on the small side and generally don't have as many repeat regions as plant and metazoan genomes. As well, very high and very low GC content can be erroneously masked out. So, before you mask take some time to think about your genome and its characteristics. | ||
| + | |||
| + | Example scripts for submitting these steps to Perun are available on the [[https:// | ||
| GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your **home** directory: | GeneMark-ET licence expires often (usually within 3 months), so before you run the workflow make sure you have an up to date licence for GeneMark-ET hidden in your **home** directory: | ||
| Line 20: | Line 26: | ||
| mv gm_key_64 .gm_key | mv gm_key_64 .gm_key | ||
| - | ===== Repeat masking | + | ==== Repeat masking ==== |
| Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly. | Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly. | ||
| Line 67: | Line 73: | ||
| </ | </ | ||
| - | ===== RNAseq mapping | + | ==== RNAseq mapping ==== |
| Line 93: | Line 99: | ||
| </ | </ | ||
| | | ||
| - | ===== Genome-guided transcriptome assembly | + | ==== Genome-guided transcriptome assembly ==== |
| Line 117: | Line 123: | ||
| </ | </ | ||
| - | ===== Predicting gene models with PASA ===== | ||
| - | + | ==== Braker2 ==== | |
| - | ===== Gene prediction with Braker2 | + | |
| Line 130: | Line 134: | ||
| #$ -cwd | #$ -cwd | ||
| #$ -pe threaded 20 | #$ -pe threaded 20 | ||
| + | |||
| cd $PWD | cd $PWD | ||
| + | |||
| source activate braker-preq | source activate braker-preq | ||
| + | |||
| export VALIDATOR=/ | export VALIDATOR=/ | ||
| export AUGUSTUS_BIN_PATH=/ | export AUGUSTUS_BIN_PATH=/ | ||
| Line 138: | Line 145: | ||
| export PATH=/ | export PATH=/ | ||
| export GENEMARK_PATH=/ | export GENEMARK_PATH=/ | ||
| - | braker.pl --species=your_species --bam=yourmasked_genome.sambamsorted.bam --genome=$original_genome --softmasking --cores=20 | + | |
| + | braker.pl | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| </ | </ | ||
| + | |||
| + | |||
| + | ==== Predicting gene models with PASA ==== | ||
| + | |||
| + | PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome. | ||
| + | |||
| + | You need to specify your pasa config file. Below an example: | ||
| + | |||
| + | < | ||
| + | ## templated variables to be replaced exist as < | ||
| + | |||
| + | # database settings | ||
| + | ## pasa will create an sqlite database in the location desired below | ||
| + | DATABASE=/ | ||
| + | |||
| + | ####################################################### | ||
| + | # Parameters to specify to specific scripts in pipeline | ||
| + | # create a key = " | ||
| + | # assign a value as done above. | ||
| + | |||
| + | #script validate_alignments_in_db.dbi | ||
| + | validate_alignments_in_db.dbi: | ||
| + | validate_alignments_in_db.dbi: | ||
| + | validate_alignments_in_db.dbi: | ||
| + | |||
| + | #script subcluster_builder.dbi | ||
| + | subcluster_builder.dbi: | ||
| + | </ | ||
| + | |||
| + | < | ||
| + | # | ||
| + | #$ -S /bin/bash | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 20 | ||
| + | |||
| + | # input | ||
| + | CONFIG=' | ||
| + | GENOME=' | ||
| + | TRANSCRIPTOME=' | ||
| + | THREADS=20 | ||
| + | |||
| + | source activate pasa-2.5.2 | ||
| + | |||
| + | # run pasa | ||
| + | Launch_PASA_pipeline.pl \ | ||
| + | --create --run \ | ||
| + | -c $CONFIG \ | ||
| + | -g $GENOME \ | ||
| + | -t $TRANSCRIPTOME \ | ||
| + | --transcribed_is_aligned_orient \ | ||
| + | --ALIGNERS blat, | ||
| + | --CPU $THREADS | ||
| + | |||
| + | conda deactivate | ||
| + | </ | ||
| + | |||
| + | ==== Compiling the final gene models with EvidenceModeler ==== | ||
| + | |||
gene_prediction_framework.1670337694.txt.gz · Last modified: by 134.190.232.140
