gene_prediction_find_supported_orfs
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| gene_prediction_find_supported_orfs [2025/08/18 14:00] – 134.190.145.228 | gene_prediction_find_supported_orfs [2025/08/18 15:53] (current) – 134.190.145.228 | ||
|---|---|---|---|
| Line 12: | Line 12: | ||
| ===== find_supported_orfs.py in a nutshell ===== | ===== find_supported_orfs.py in a nutshell ===== | ||
| + | |||
| + | 1. Load BAM file representing the RNAseq data aligned to the genome sequence | ||
| + | |||
| + | 2. Load FASTA file representing the genome sequence | ||
| + | |||
| + | 3. Interrogate the RNAseq BAM file and extract **the locations of all supported introns** | ||
| + | |||
| + | An intron is considered supported when... | ||
| + | |||
| + | * Have at least 5 reads supporting the existence of this intron | ||
| + | * Must be within a certain length range (set by the user) | ||
| + | * Must have specific intron boundaries (e.g. GT-AG, or AT-AG; set by the user) | ||
| + | * At least 40% of the reads mapping here must support this intron | ||
| + | |||
| + | 4. Interrogate the RNAseq BAM file to identify all regions that have high RNAseq coverage | ||
| + | |||
| + | A brief description of this **high coverage region** finding algorithm: | ||
| + | |||
| + | * Iterates over each position in the genome | ||
| + | * Calculate RNAseq coverage for that position (positions supported as introns are also considered ' | ||
| + | * Once a position exceeds a coverage threshold (8 by default), it marks this position as the start of the high coverage region | ||
| + | * Every new position with at least 8 coverage gets added to the high coverage region | ||
| + | * If it encounters a position with less than 8 coverage, it does not immediately mark the end of the high coverage region. | ||
| + | * If the coverage remains low for a set number of positions (i.e. the user-defined ' | ||
| + | * If the coverage returns to high levels after a brief dip, it will keep extending the high coverage region. | ||
| + | * Finally, 25 bps are added to the beginning and the end of the high coverage region. Thus extending the region by 50 bp in total. This " | ||
| + | |||
| + | Currently, a high coverage region must be at least 150 bp in length to move on to the next stage | ||
| + | |||
| + | |||
| + | {{ : | ||
| + | |||
| + | 5. High coverage regions are matched with supported introns and **spliced** accordingly | ||
| + | |||
| + | If any introns in this region are found to be immediately adjacent to each other, or overlapping, | ||
| + | |||
| + | 6. Open reading frames are identified in this high coverage region | ||
| + | |||
| + | Each ORF must be 300 - 10000 bp long. If two ORFs overlap (for example two ORFs covering the same gene but with a different start codon), the longest is preferred. | ||
| + | |||
| + | 7. Introns are then re-inserted and coordinates are adjusted accordingly | ||
| + | |||
| + | 8. The genes are then printed in GFF3 format | ||
| + | |||
| + | ===== Blastocystis ===== | ||
| + | |||
| + | Many // | ||
| + | |||
| + | Run of the mill gene predictors are unaware of this peculiarity, | ||
| + | |||
| + | find_supported_orfs.py integrates a nifty work-around to recognize such genes. The terminal ' | ||
| + | |||
| + | Currently, only the last 50 bp of a high coverage region are search for this motif, and only the last instance of this motif is edited. This strictness was found to eliminate many false positive hits. | ||
| + | |||
gene_prediction_find_supported_orfs.1755536459.txt.gz · Last modified: by 134.190.145.228
