Differences

This shows you the differences between two versions of the page.

--- gene_prediction_find_supported_orfs [2025/08/18 14:00] – 134.190.145.228
+++ gene_prediction_find_supported_orfs [2025/08/18 15:53] (current) – 134.190.145.228
@@ Line 12: / Line 12: @@
 ===== find_supported_orfs.py in a nutshell =====
+. Load BAM file representing the RNAseq data aligned to the genome sequence
+. Load FASTA file representing the genome sequence
+. Interrogate the RNAseq BAM file and extract **the locations of all supported introns**
+An intron is considered supported when...
+  * Have at least 5 reads supporting the existence of this intron
+  * Must be within a certain length range (set by the user)
+  * Must have specific intron boundaries (e.g. GT-AG, or AT-AG; set by the user)
+  * At least 40% of the reads mapping here must support this intron
+. Interrogate the RNAseq BAM file to identify all regions that have high RNAseq coverage
+A brief description of this **high coverage region** finding algorithm:
+  * Iterates over each position in the genome
+  * Calculate RNAseq coverage for that position (positions supported as introns are also considered 'covered')
+  * Once a position exceeds a coverage threshold (8 by default), it marks this position as the start of the high coverage region
+  * Every new position with at least 8 coverage gets added to the high coverage region
+  * If it encounters a position with less than 8 coverage, it does not immediately mark the end of the high coverage region.
+  * If the coverage remains low for a set number of positions (i.e. the user-defined 'window size'), it will then mark the last known high coverage position as the end of the high coverage region.
+  * If the coverage returns to high levels after a brief dip, it will keep extending the high coverage region.
+  * Finally, 25 bps are added to the beginning and the end of the high coverage region. Thus extending the region by 50 bp in total. This "buffer" makes the gene finding a bit more sensitive.
+Currently, a high coverage region must be at least 150 bp in length to move on to the next stage
+{{ :high_cov_region.jpg?nolink |}}
+. High coverage regions are matched with supported introns and **spliced** accordingly
+If any introns in this region are found to be immediately adjacent to each other, or overlapping, the set of introns is reduced to a non-overlapping one, where the most supported introns are preferred.
+. Open reading frames are identified in this high coverage region
+Each ORF must be 300 - 10000 bp long. If two ORFs overlap (for example two ORFs covering the same gene but with a different start codon), the longest is preferred.
+. Introns are then re-inserted and coordinates are adjusted accordingly
+. The genes are then printed in GFF3 format
+===== Blastocystis =====
+Many //Blastocystis// lineages contain protein coding genes that do not end with a canonical STOP codon. Instead, a gene can end with 'T', 'TA', or 'TG'. Then, when the polyA tail is added during mRNA maturation, the STOP codon is completed.
+Run of the mill gene predictors are unaware of this peculiarity, and will thus miss these genes. As some //Blastocystis// genomes can have up to 30-40% of such genes, this is a pretty major problem.
+find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal 'T', 'TA' or 'TG' is part of a conserved TxxxxTGTTTGTT motif, where 'x' can be any base. The script will thus search for this motif in high coverage regions, and when found, change the sequence to TAAxxTGTTTGTT in-memory before passing it on to the ORF finder. Since now, the sequence does end with 'TAA', the ORF finder module will recognize the gene!
+Currently, only the last 50 bp of a high coverage region are search for this motif, and only the last instance of this motif is edited. This strictness was found to eliminate many false positive hits.