User Tools

Site Tools


gene_prediction_find_supported_orfs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
gene_prediction_find_supported_orfs [2025/08/18 15:12] – [find_supported_orfs.py in a nutshell] 134.190.145.228gene_prediction_find_supported_orfs [2025/08/18 15:53] (current) 134.190.145.228
Line 40: Line 40:
  
 Currently, a high coverage region must be at least 150 bp in length to move on to the next stage Currently, a high coverage region must be at least 150 bp in length to move on to the next stage
 +
 +
 +{{ :high_cov_region.jpg?nolink |}}
  
 5. High coverage regions are matched with supported introns and **spliced** accordingly 5. High coverage regions are matched with supported introns and **spliced** accordingly
Line 55: Line 58:
 ===== Blastocystis =====  ===== Blastocystis ===== 
  
-Many Blastocystis lineages contain protein coding genes that do not end with a canonical STOP codon. Instead, a gene can end with 'T', 'TA', or 'TG'. Then, when the polyA tail is added during mRNA maturation, the STOP codon is completed.+Many //Blastocystis// lineages contain protein coding genes that do not end with a canonical STOP codon. Instead, a gene can end with 'T', 'TA', or 'TG'. Then, when the polyA tail is added during mRNA maturation, the STOP codon is completed. 
 + 
 +Run of the mill gene predictors are unaware of this peculiarity, and will thus miss these genes. As some //Blastocystis// genomes can have up to 30-40% of such genes, this is a pretty major problem. 
 + 
 +find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal 'T', 'TA' or 'TG' is part of a conserved TxxxxTGTTTGTT motif, where 'x' can be any base. The script will thus search for this motif in high coverage regions, and when found, change the sequence to TAAxxTGTTTGTT in-memory before passing it on to the ORF finder. Since now, the sequence does end with 'TAA', the ORF finder module will recognize the gene!
  
-Run of the mill gene predictors are unaware of this peculiarity, and will thus miss these genes. As some Blastocystis genomes can have up to 30-40% of such genes, this is a pretty major problem.+Currently, only the last 50 bp of a high coverage region are search for this motif, and only the last instance of this motif is edited. This strictness was found to eliminate many false positive hits.
  
-find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal 'T', 'TA' or 'TG' is part of a conserved TxxxxTGTTTGTT motif, where 'x' can be any base. 
gene_prediction_find_supported_orfs.1755540721.txt.gz · Last modified: by 134.190.145.228