gene_prediction_with_braker2_pipeline
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| gene_prediction_with_braker2_pipeline [2023/02/24 18:02] – 134.190.247.138 | gene_prediction_with_braker2_pipeline [2025/11/18 14:23] (current) – [Genome-guided transcriptome assembly] 134.190.191.148 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ===== Gene prediction with Braker2 pipeline ===== | + | ====== Gene prediction with the Braker2 pipeline====== |
| GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\ | GP using machine learning and extrinsic hints by **DE Salas-Leiva** (last updated Oct-21-2020)\\ | ||
| Line 24: | Line 24: | ||
| mv gm_key_64 .gm_key | mv gm_key_64 .gm_key | ||
| - | ==== Repeat masking ==== | + | ===== Repeat masking ===== |
| + | |||
| + | From the BRAKER1 paper: | ||
| + | |||
| + | " | ||
| + | |||
| + | Some repetitive elements in your genome may per-chance look like ORFs or even protein coding genes. The main purpose of masking these repeats is to prevent your gene predictor from even looking at these regions, so they will not predict any false positive genes there. | ||
| Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly. | Mask the repetitive regions in your assembly using the following shell script. BuildDatabase and RepeatModeler will create a species-specific library of repeats from your genome, and then RepeatMasker will use that library to mask repetitive regions in your assembly. | ||
| Line 71: | Line 77: | ||
| </ | </ | ||
| - | ==== RNAseq mapping ==== | + | ===== RNAseq mapping |
| + | RNAseq data is direct evidence of which areas of your genome are expressed. Mapping your RNAseq data to your genome with a splice-aware mapper such as Hisat2 will yield information on the starts and stops of protein coding genes, as well as, most importantly perhaps, the start and stop coordinates of introns. | ||
| On the masked assembly map the RNAseq using Hisat2, sort the output and create a depth file: | On the masked assembly map the RNAseq using Hisat2, sort the output and create a depth file: | ||
| Line 86: | Line 93: | ||
| #$ -cwd | #$ -cwd | ||
| #$ -pe threaded 10 | #$ -pe threaded 10 | ||
| + | |||
| cd $PWD | cd $PWD | ||
| source activate hisat2 | source activate hisat2 | ||
| + | |||
| hisat2-build -f your_masked_genome your_genome_index | hisat2-build -f your_masked_genome your_genome_index | ||
| - | hisat2 -q -p 10 --phred33 --max-intronlen 1000 -k 2 -x yourgenome_index -1 READS_R1 -2 READS_R2 --rna-strandness RF -S your_masked_genome.sam | + | hisat2 |
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| source deactivate | source deactivate | ||
| samtools view -bS -o your_genome.sambam your_masked_genome.sam | samtools view -bS -o your_genome.sambam your_masked_genome.sam | ||
| - | samtools sort your masked_genome.sambam -o yourmasked_genome.sambamsorted.bam | + | samtools sort your_masked_genome.sambam -o yourmasked_genome.sambamsorted.bam |
| samtools index your_masked_genome.sambamsorted.bam | samtools index your_masked_genome.sambamsorted.bam | ||
| </ | </ | ||
| | | ||
| - | ==== Genome-guided transcriptome assembly ==== | + | NOTE: The '' |
| + | |||
| + | |||
| + | ===== Genome-guided transcriptome assembly | ||
| Line 107: | Line 130: | ||
| #$ -cwd | #$ -cwd | ||
| #$ -pe threaded 10 | #$ -pe threaded 10 | ||
| + | |||
| cd $PWD | cd $PWD | ||
| + | |||
| source activate trinity-2.11-with-workaround | source activate trinity-2.11-with-workaround | ||
| - | Trinity --CPU 10 --max_memory 100G --genome_guided_bam yourgenome.fasta.sambamsorted.bam --genome_guided_max_intron 1000 --SS_lib_type RF | + | |
| + | Trinity | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| conda deactivate | conda deactivate | ||
| </ | </ | ||
| Line 122: | Line 154: | ||
| - | ==== Braker2 ==== | + | ===== Braker2 |
| + | [[https:// | ||
| + | |||
| + | - Intron start and end coordinates (//intron hints//) are extracted from the RNAseq BAM file | ||
| + | - These are then used along with the genome FASTA file to train GeneMarkET | ||
| + | - The trained GeneMarkET performs an "//ab initio//" | ||
| + | - Those predicted gene structures for which all introns are supported by the RNAseq data (//anchored introns//) are selected to train AUGUSTUS | ||
| + | - The trained AUGUSTUS now predicts gene structures using again the intron hints as " | ||
| + | |||
| + | {{:: | ||
| + | |||
| + | The intron hints are extracted using a the '' | ||
| + | |||
| + | If you only use RNAseq as extrinsic evidence, you essentially can only use //donor splice site// and //acceptor splice site// hints. If you also have protein homology information, | ||
| + | |||
| + | The intron hints contain explicit location information and influence the optimal path through the GHMM machine (AUGUSTUS+ paper, Stanke et al 2006). It is important to note that since this is a probabilistic model, **hints can sometimes be ignored if the intrinsic information is strong enough!** | ||
| Predict genes using Genemark-ET and Augustus through braker2: | Predict genes using Genemark-ET and Augustus through braker2: | ||
| Line 152: | Line 199: | ||
| - | ==== Predicting gene models with PASA ==== | + | ===== Predicting gene models with PASA ===== |
| PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome. | PASA will use the genome-guided transcriptome assembly to estimate where gene models are located. It does this by aligning the assembled transcripts to the reference genome. | ||
| Line 206: | Line 253: | ||
| </ | </ | ||
| - | ==== Combining gene models with EVidenceModeler(EVM)==== | + | ===== Combining gene models with EVidenceModeler (EVM) (work in progress) ===== |
| - | ==NOTE: EVM requires genome sequences in Fasta format and all gene structures and alignment evidences described in GFF3 format.== | + | NOTE: EVM requires genome sequences in Fasta format and all gene structures and alignment evidences described in GFF3 format.== |
| - | (1) Check the integrity of the gene models you wish to combine via EVM's native validator. | + | ==== Validation ==== |
| + | |||
| + | Check the integrity of the gene models you wish to combine via EVM's native validator. | ||
| < | < | ||
| Line 235: | Line 284: | ||
| </ | </ | ||
| - | (2) EVM requires partitioning of the genome assembly for parallelization, | + | ==== Partition ==== |
| + | |||
| + | EVM requires partitioning of the genome assembly for parallelization, | ||
| This step would generate directories containing fragments from each gene models for each contig, as well as a comprehensive list of these fragments and their corresponding system locations. | This step would generate directories containing fragments from each gene models for each contig, as well as a comprehensive list of these fragments and their corresponding system locations. | ||
| Line 267: | Line 318: | ||
| </ | </ | ||
| - | (3) Since gene models can contain conflicting predictions, | + | ==== Assigning Weights ==== |
| + | |||
| + | Since gene models can contain conflicting predictions, | ||
| The weights file has three columns including class, type and weight. The string for class is limited to the options shown below, but other columns can have arbitray values of the same type. Weight is in ascending order. Below is an example: | The weights file has three columns including class, type and weight. The string for class is limited to the options shown below, but other columns can have arbitray values of the same type. Weight is in ascending order. Below is an example: | ||
| Line 282: | Line 335: | ||
| </ | </ | ||
| - | (4) With all the requiste input files ready, a series of commands need to be generated for running parallelly. | + | ==== Generating Commands ==== |
| + | |||
| + | With all the requiste input files ready, a series of commands need to be generated for running parallelly. | ||
| The file < | The file < | ||
| Line 317: | Line 372: | ||
| </ | </ | ||
| - | (5) Run the series of commands in parallel with the below script: | + | ==== Running Commands ==== |
| + | |||
| + | Run the series of commands in parallel with the below script: | ||
| < | < | ||
| Line 341: | Line 398: | ||
| </ | </ | ||
| - | (6) Recombine the fragments for each contig: | + | ==== Combining Fragments ==== |
| + | |||
| + | Recombine the fragments for each contig: | ||
| < | < | ||
| Line 364: | Line 423: | ||
| </ | </ | ||
| - | (7) Convert contigs to gff3: | + | ==== Converting To GFF3 ==== |
| + | |||
| + | Convert contigs to gff3: | ||
| < | < | ||
| Line 388: | Line 449: | ||
| </ | </ | ||
| - | (8) Combine gff3 file from each contig to one file: | + | ==== Combining Contigs ==== |
| + | |||
| + | Combine gff3 file from each contig to one file: | ||
| < | < | ||
| cat */ | cat */ | ||
| </ | </ | ||
gene_prediction_with_braker2_pipeline.1677276156.txt.gz · Last modified: by 134.190.247.138
