User Tools

Site Tools


gene_prediction_find_supported_orfs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
gene_prediction_find_supported_orfs [2025/08/18 14:59] – [find_supported_orfs.py in a nutshell] 134.190.145.228gene_prediction_find_supported_orfs [2025/08/18 15:53] (current) 134.190.145.228
Line 17: Line 17:
 2. Load FASTA file representing the genome sequence 2. Load FASTA file representing the genome sequence
  
-3. Interrogate the RNAseq BAM file and extract **the locations of all introns**+3. Interrogate the RNAseq BAM file and extract **the locations of all supported introns**
  
-To be added, an intron must...+An intron is considered supported when...
  
   * Have at least 5 reads supporting the existence of this intron   * Have at least 5 reads supporting the existence of this intron
Line 40: Line 40:
  
 Currently, a high coverage region must be at least 150 bp in length to move on to the next stage Currently, a high coverage region must be at least 150 bp in length to move on to the next stage
 +
 +
 +{{ :high_cov_region.jpg?nolink |}}
  
 5. High coverage regions are matched with supported introns and **spliced** accordingly 5. High coverage regions are matched with supported introns and **spliced** accordingly
Line 53: Line 56:
 8. The genes are then printed in GFF3 format 8. The genes are then printed in GFF3 format
  
 +===== Blastocystis ===== 
 +
 +Many //Blastocystis// lineages contain protein coding genes that do not end with a canonical STOP codon. Instead, a gene can end with 'T', 'TA', or 'TG'. Then, when the polyA tail is added during mRNA maturation, the STOP codon is completed.
 +
 +Run of the mill gene predictors are unaware of this peculiarity, and will thus miss these genes. As some //Blastocystis// genomes can have up to 30-40% of such genes, this is a pretty major problem.
 +
 +find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal 'T', 'TA' or 'TG' is part of a conserved TxxxxTGTTTGTT motif, where 'x' can be any base. The script will thus search for this motif in high coverage regions, and when found, change the sequence to TAAxxTGTTTGTT in-memory before passing it on to the ORF finder. Since now, the sequence does end with 'TAA', the ORF finder module will recognize the gene!
 +
 +Currently, only the last 50 bp of a high coverage region are search for this motif, and only the last instance of this motif is edited. This strictness was found to eliminate many false positive hits.
  
gene_prediction_find_supported_orfs.1755539954.txt.gz · Last modified: by 134.190.145.228