gene_prediction_find_supported_orfs
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| gene_prediction_find_supported_orfs [2025/08/18 13:58] – created 134.190.145.228 | gene_prediction_find_supported_orfs [2025/08/18 15:53] (current) – 134.190.145.228 | ||
|---|---|---|---|
| Line 9: | Line 9: | ||
| However, in our experience, the pipeline tools often predict introns in locations where RNAseq data strongly suggests there should not be any introns. I'm not entirely sure why this happens, but I suspect that it is due to how the RNAseq data is used. In these tools, the RNAseq data is used to train or update the Hidden Markov Models (HMMs), and then it is these updated HMMs that are used to predict genes in a manner similar to that of ab initio tools. Hence, after these HMMs are updated, exact intron locations available in the RNAseq data, are ' | However, in our experience, the pipeline tools often predict introns in locations where RNAseq data strongly suggests there should not be any introns. I'm not entirely sure why this happens, but I suspect that it is due to how the RNAseq data is used. In these tools, the RNAseq data is used to train or update the Hidden Markov Models (HMMs), and then it is these updated HMMs that are used to predict genes in a manner similar to that of ab initio tools. Hence, after these HMMs are updated, exact intron locations available in the RNAseq data, are ' | ||
| - | This prompted me to develop a more straightforward python script that directly uses intron locations available in the RNAseq data as it predicts genes. The flipside is that it won't be able to predict genes in areas of the genome that are not expressed, and thus not available in the RNAseq data. | + | This prompted me to develop a more straightforward python script that directly uses intron locations available in the RNAseq data as it predicts genes. The flipside is that it won't be able to predict genes in areas of the genome that are not expressed, and thus not available in the RNAseq data. I callec this script **find_supported_orfs.py**. It's pretty dumb, in the sense that it doesn' |
| + | |||
| + | ===== find_supported_orfs.py in a nutshell ===== | ||
| + | |||
| + | 1. Load BAM file representing the RNAseq data aligned to the genome sequence | ||
| + | |||
| + | 2. Load FASTA file representing the genome sequence | ||
| + | |||
| + | 3. Interrogate the RNAseq BAM file and extract **the locations of all supported introns** | ||
| + | |||
| + | An intron is considered supported when... | ||
| + | |||
| + | * Have at least 5 reads supporting the existence of this intron | ||
| + | * Must be within a certain length range (set by the user) | ||
| + | * Must have specific intron boundaries (e.g. GT-AG, or AT-AG; set by the user) | ||
| + | * At least 40% of the reads mapping here must support this intron | ||
| + | |||
| + | 4. Interrogate the RNAseq BAM file to identify all regions that have high RNAseq coverage | ||
| + | |||
| + | A brief description of this **high coverage region** finding algorithm: | ||
| + | |||
| + | * Iterates over each position in the genome | ||
| + | * Calculate RNAseq coverage for that position (positions supported as introns are also considered ' | ||
| + | * Once a position exceeds a coverage threshold (8 by default), it marks this position as the start of the high coverage region | ||
| + | * Every new position with at least 8 coverage gets added to the high coverage region | ||
| + | * If it encounters a position with less than 8 coverage, it does not immediately mark the end of the high coverage region. | ||
| + | * If the coverage remains low for a set number of positions (i.e. the user-defined ' | ||
| + | * If the coverage returns to high levels after a brief dip, it will keep extending the high coverage region. | ||
| + | * Finally, 25 bps are added to the beginning and the end of the high coverage region. Thus extending the region by 50 bp in total. This " | ||
| + | |||
| + | Currently, a high coverage region must be at least 150 bp in length to move on to the next stage | ||
| + | |||
| + | |||
| + | {{ : | ||
| + | |||
| + | 5. High coverage regions are matched with supported introns and **spliced** accordingly | ||
| + | |||
| + | If any introns in this region are found to be immediately adjacent to each other, or overlapping, | ||
| + | |||
| + | 6. Open reading frames are identified in this high coverage region | ||
| + | |||
| + | Each ORF must be 300 - 10000 bp long. If two ORFs overlap (for example two ORFs covering the same gene but with a different start codon), the longest is preferred. | ||
| + | |||
| + | 7. Introns are then re-inserted and coordinates are adjusted accordingly | ||
| + | |||
| + | 8. The genes are then printed in GFF3 format | ||
| + | |||
| + | ===== Blastocystis ===== | ||
| + | |||
| + | Many // | ||
| + | |||
| + | Run of the mill gene predictors are unaware of this peculiarity, | ||
| + | |||
| + | find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal ' | ||
| + | |||
| + | Currently, only the last 50 bp of a high coverage region are search for this motif, and only the last instance of this motif is edited. This strictness was found to eliminate many false positive hits. | ||
gene_prediction_find_supported_orfs.1755536319.txt.gz · Last modified: by 134.190.145.228
