User Tools

Site Tools


gene_prediction_find_supported_orfs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
gene_prediction_find_supported_orfs [2025/08/18 14:00] 134.190.145.228gene_prediction_find_supported_orfs [2025/08/18 15:53] (current) 134.190.145.228
Line 12: Line 12:
  
 ===== find_supported_orfs.py in a nutshell =====  ===== find_supported_orfs.py in a nutshell ===== 
 +
 +1. Load BAM file representing the RNAseq data aligned to the genome sequence
 +
 +2. Load FASTA file representing the genome sequence
 +
 +3. Interrogate the RNAseq BAM file and extract **the locations of all supported introns**
 +
 +An intron is considered supported when...
 +
 +  * Have at least 5 reads supporting the existence of this intron
 +  * Must be within a certain length range (set by the user)
 +  * Must have specific intron boundaries (e.g. GT-AG, or AT-AG; set by the user)
 +  * At least 40% of the reads mapping here must support this intron
 +
 +4. Interrogate the RNAseq BAM file to identify all regions that have high RNAseq coverage
 +
 +A brief description of this **high coverage region** finding algorithm:
 +
 +  * Iterates over each position in the genome
 +  * Calculate RNAseq coverage for that position (positions supported as introns are also considered 'covered')
 +  * Once a position exceeds a coverage threshold (8 by default), it marks this position as the start of the high coverage region
 +  * Every new position with at least 8 coverage gets added to the high coverage region
 +  * If it encounters a position with less than 8 coverage, it does not immediately mark the end of the high coverage region. 
 +  * If the coverage remains low for a set number of positions (i.e. the user-defined 'window size'), it will then mark the last known high coverage position as the end of the high coverage region.
 +  * If the coverage returns to high levels after a brief dip, it will keep extending the high coverage region.
 +  * Finally, 25 bps are added to the beginning and the end of the high coverage region. Thus extending the region by 50 bp in total. This "buffer" makes the gene finding a bit more sensitive.
 +
 +Currently, a high coverage region must be at least 150 bp in length to move on to the next stage
 +
 +
 +{{ :high_cov_region.jpg?nolink |}}
 +
 +5. High coverage regions are matched with supported introns and **spliced** accordingly
 +
 +If any introns in this region are found to be immediately adjacent to each other, or overlapping, the set of introns is reduced to a non-overlapping one, where the most supported introns are preferred.
 +
 +6. Open reading frames are identified in this high coverage region
 +
 +Each ORF must be 300 - 10000 bp long. If two ORFs overlap (for example two ORFs covering the same gene but with a different start codon), the longest is preferred.
 +
 +7. Introns are then re-inserted and coordinates are adjusted accordingly
 +
 +8. The genes are then printed in GFF3 format
 +
 +===== Blastocystis ===== 
 +
 +Many //Blastocystis// lineages contain protein coding genes that do not end with a canonical STOP codon. Instead, a gene can end with 'T', 'TA', or 'TG'. Then, when the polyA tail is added during mRNA maturation, the STOP codon is completed.
 +
 +Run of the mill gene predictors are unaware of this peculiarity, and will thus miss these genes. As some //Blastocystis// genomes can have up to 30-40% of such genes, this is a pretty major problem.
 +
 +find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal 'T', 'TA' or 'TG' is part of a conserved TxxxxTGTTTGTT motif, where 'x' can be any base. The script will thus search for this motif in high coverage regions, and when found, change the sequence to TAAxxTGTTTGTT in-memory before passing it on to the ORF finder. Since now, the sequence does end with 'TAA', the ORF finder module will recognize the gene!
 +
 +Currently, only the last 50 bp of a high coverage region are search for this motif, and only the last instance of this motif is edited. This strictness was found to eliminate many false positive hits.
 +
gene_prediction_find_supported_orfs.1755536459.txt.gz · Last modified: by 134.190.145.228