Differences

This shows you the differences between two versions of the page.

--- gene_prediction_find_supported_orfs [2025/08/18 13:58] – created 134.190.145.228
+++ gene_prediction_find_supported_orfs [2025/08/18 15:53] (current) – 134.190.145.228
@@ Line 9: / Line 9: @@
 However, in our experience, the pipeline tools often predict introns in locations where RNAseq data strongly suggests there should not be any introns. I'm not entirely sure why this happens, but I suspect that it is due to how the RNAseq data is used. In these tools, the RNAseq data is used to train or update the Hidden Markov Models (HMMs), and then it is these updated HMMs that are used to predict genes in a manner similar to that of ab initio tools. Hence, after these HMMs are updated, exact intron locations available in the RNAseq data, are 'forgotten' when the actual gene prediction occurs.
-This prompted me to develop a more straightforward python script that directly uses intron locations available in the RNAseq data as it predicts genes. The flipside is that it won't be able to predict genes in areas of the genome that are not expressed, and thus not available in the RNAseq data.
+This prompted me to develop a more straightforward python script that directly uses intron locations available in the RNAseq data as it predicts genes. The flipside is that it won't be able to predict genes in areas of the genome that are not expressed, and thus not available in the RNAseq data. I callec this script **find_supported_orfs.py**. It's pretty dumb, in the sense that it doesn't make use of fancy Hidden Markov Models or other types of sophisticated statistical tricks.
+===== find_supported_orfs.py in a nutshell =====
+. Load BAM file representing the RNAseq data aligned to the genome sequence
+. Load FASTA file representing the genome sequence
+. Interrogate the RNAseq BAM file and extract **the locations of all supported introns**
+An intron is considered supported when...
+  * Have at least 5 reads supporting the existence of this intron
+  * Must be within a certain length range (set by the user)
+  * Must have specific intron boundaries (e.g. GT-AG, or AT-AG; set by the user)
+  * At least 40% of the reads mapping here must support this intron
+. Interrogate the RNAseq BAM file to identify all regions that have high RNAseq coverage
+A brief description of this **high coverage region** finding algorithm:
+  * Iterates over each position in the genome
+  * Calculate RNAseq coverage for that position (positions supported as introns are also considered 'covered')
+  * Once a position exceeds a coverage threshold (8 by default), it marks this position as the start of the high coverage region
+  * Every new position with at least 8 coverage gets added to the high coverage region
+  * If it encounters a position with less than 8 coverage, it does not immediately mark the end of the high coverage region.
+  * If the coverage remains low for a set number of positions (i.e. the user-defined 'window size'), it will then mark the last known high coverage position as the end of the high coverage region.
+  * If the coverage returns to high levels after a brief dip, it will keep extending the high coverage region.
+  * Finally, 25 bps are added to the beginning and the end of the high coverage region. Thus extending the region by 50 bp in total. This "buffer" makes the gene finding a bit more sensitive.
+Currently, a high coverage region must be at least 150 bp in length to move on to the next stage
+{{ :high_cov_region.jpg?nolink |}}
+. High coverage regions are matched with supported introns and **spliced** accordingly
+If any introns in this region are found to be immediately adjacent to each other, or overlapping, the set of introns is reduced to a non-overlapping one, where the most supported introns are preferred.
+. Open reading frames are identified in this high coverage region
+Each ORF must be 300 - 10000 bp long. If two ORFs overlap (for example two ORFs covering the same gene but with a different start codon), the longest is preferred.
+. Introns are then re-inserted and coordinates are adjusted accordingly
+. The genes are then printed in GFF3 format
+===== Blastocystis =====
+Many //Blastocystis// lineages contain protein coding genes that do not end with a canonical STOP codon. Instead, a gene can end with 'T', 'TA', or 'TG'. Then, when the polyA tail is added during mRNA maturation, the STOP codon is completed.
+Run of the mill gene predictors are unaware of this peculiarity, and will thus miss these genes. As some //Blastocystis// genomes can have up to 30-40% of such genes, this is a pretty major problem.
+find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal 'T', 'TA' or 'TG' is part of a conserved TxxxxTGTTTGTT motif, where 'x' can be any base. The script will thus search for this motif in high coverage regions, and when found, change the sequence to TAAxxTGTTTGTT in-memory before passing it on to the ORF finder. Since now, the sequence does end with 'TAA', the ORF finder module will recognize the gene!
+Currently, only the last 50 bp of a high coverage region are search for this motif, and only the last instance of this motif is edited. This strictness was found to eliminate many false positive hits.