This is an old revision of the document!
Table of Contents
orf160 in Blastocystis mitochondrial genomes
Knowledge from the Literature: Jacob et al, 2016, GBE
Basic properties
orf160 is an enigmatic gene found in Blastocystis mitochondrial genomes. Named as such because, even though its not technically an ORF (see below), it looks like one, and its 158 and 162 amino acids long (about ~480 nt)
In-frame STOP codons
Only in ST4 DMP/02-328 and ST4 DMP/10-212 is this a bona fide ORF. In all other described subtypes, it has one or two in-frame STOP codons. In ST4, position 9 is TGG(Trp) or TAT(Tyr). Position 11 is TTG(Leu). In all other strains, positions 9 and 11 are TAG STOP codons. In ST8 DMP/08-128, position 2 is a TGA STOP codon. Most other STOP codons in the Blastocystis mitochondrial genomes are TAA. No alternative START codons (CTG or TTG) are found after position 9
No clear homologs
It has extremely low sequence similarity to anything in the public sequence databases, and no BLAST hits outside of Blastocystis were identified. Even within Blastocystis, BLAST hits only have ~27.5% sequence similarity at the amino acid level. The %GC and amino acid composition are somewhat reminiscent of some ribosomal proteins
It is not a pseudogene
It is unlikely to be a pseudogene:
- If the in-frame STOP codon was due to pseudogenization, it was a pseudogene already in the Blastocystis ancestor. Given the age of the pseudogene, we would expect to see more in-frame STOP codons
- There are no other in-frame STOP codons
- The ORF is still pretty long, 480 nucleotides
- dN/dS ratio < 1, indicating negative selection
orf160 (negative strand) overlaps with its upstream neighbor, nad7 (negative strand)
Jacob et al identified a 55 or 56 bp overlap on the N-terminal end of orf160. This means that the first 19 codons / aa’s of orf160 overlap with the C-terminus of nad7. We checked with the Blastocystis NandII strain the overlap with nad7
No evidence as of yet for transcription of this gene
Jacob et al tried RT-PCR. Roger lab with Eleni Gentekaki also found no EST or RNAseq evidence
Hypothesis 1: TAG has been re-assigned to a sense codon
If this is true, all mitochondrial genes should end with TAA or TGA, which according to Jacob et al is not always true. If this is true, we should be able to find a tRNA with an anticodon able to recognize TAG. Jacob et al was unable to find any in the ST1 nuclear genome
Hypothesis 2: TAG in position 9 is RNA edited to TAA
Unable to check this hypothesis because no RNA data available
Hypothesis 3: Translational read-through circumvents translation of TAG
Hypothesis 4: Alternative start codon? Other than CTG or TTG
Joran's work in June 2025
orf160 encodes for mitochondrial ribosomal subunit RPL10
I took the protein sequence of one of the two only proper orf160 copies from ST4 DMP/10-212 (APC25055.1) and threw it into online AlphaFold. It gave me a decent structure, and threw that into Foldseek. Among the AFDB-SwissProt I got top hits to 39S ribosomal protein L10, mitochondrial among Eukaryotes, and 50S ribosomal protein L10 among Bacteria. Apparently, RPL10 is either absent or unrecognized in many eukaryotic lineages (Ryo Harada et al 2025, Journal of Eukaryotic Microbiology). Among all those, the first 21-25 or so amino acids on the N-terminal region did not align in the Foldseek alignment. The unaligned N-terminal part of the query also seems to sticks out of the structure a bit.
ST4 DMP/10-212 orf160 does not seem to have any targeting signals
I tried Deeploc and SignalP
orf160 in Blastocystis genomes ST7c,e,g,h,b also have the in-frame STOP codons
As a query, I used the publically available orf160 copy of ST7B, CU914152.1, which is the complete nucleotide sequence of ST7B mitochondrial genome. The relevant sequence of orf160 in that, (coordinates 13813-14292), with the in-frame stop codon annotated:
>CU914152_13813-14292 CU914152 . . . . . . . .TAG. ATGTTACCACTGTTATTGGTAGTTTAGATATTGTTTTCGGTGATATTGATAGATAATTTA AAAATTGTACGTAAATATAATTATATTTTAAAATTTAATCAATATTTTAAAAATTATAAA TATATGTTATTTTGTGATAATACTAGTTTAAATTTGAATTTATATAAACATGAAATTTTG TTAAATCCTAATGTTAAATGTATTTTCTTAAAAAAATTTAAATGTATTGATAATTTAACA TATTTTAATTCTAATTTAAAAAATTCAACAGTAATTTTTTGTACTAATGATTTACAAACT TTATATTTAATAATTTCTAAATTACAAACTAATATATTATTTTGTAAAATACAAAATAAT TATTATTCTTTAAAAAATTTAAATACTTATATTAATTCTATATATGGATTAGTTAATTAT TTAGATAATTATATGAGTAATTTTATATTTTTATTTCAACAAATTTCTAAAAAACAATAA
I found that strains B, E (Seq_114_MRO), and G had perfect matches. C, E (Seq_115_MRO) and H had some frameshifts, which introduced many other STOP codons. The frameshifts always occurred in this area TGTATTTTCTTAAAAAAATTT, which is a bit downstream of the in-frame STOP codon
Errors were unsurprisingly in the long homopolymer A region. If you had 7, no frameshifts. C and H frameshifts could be explained by persistent, unpolished sequencing errors:
Canu struggles sometimes with circular genomes, and the MRO contigs were 1,5x - 2x too big. The ‘superfluous’ regions had for some reason poor illumina sequencing coverage, and therefore had poor polishing. Unknowing about this, (totally understandable), Greg had cut the ‘good’ parts and kept the ‘bad’ parts. orf160 was found in ‘good’ and ‘bad’ parts, and Greg had thus kept the ‘bad’ orf160
After correcting for this, by recutting the MRO contigs, the orf160 copies now had perfect matches to the CU914152 copy. The E (Seq_114_MRO) frameshift is also probably a persistent, unpolished sequencing error. However, the superfluous area of Seq_114_MRO did not have an extra orf160 copy, and no ‘good’ copy existed. Yet, if we look at the Illumina coverage of orf160 we could see that the homopolymer area in question was not well covered. Hence, its likely a remaining sequencing error
The TAG codon is probably not reassigned to a sense codon
No tRNA genes with anticodon able to recognize TAG
The perfect anticodon to TAG would be CTA, or more specifically CUA in RNA terms:
3'-AUC-5' anticodon in tRNA ||| 5'-UAG-3' codon in mRNA
However, this may also be possible? TTA, or UUA in RNA terms
3'-AUU-5' anticodon in tRNA ||| 5'-UAG-3' codon in mRNA
But UUA anticodon would fit much better with TAA, a STOP codon. So unlikely that there is actually a UUA/TTA tRNA
3'-AUU-5' anticodon in tRNA ||| 5'-UAA-3' codon in mRNA
I ran tRNAscan-SE on the nuclear and mitochondrial genome of ST7C (assembled by Greg and I), but could not find any CUA//CTA or UUA//TTA tRNAs. prokka, which runs aragorn under the hood, also did not find any tRNAs with these anticodons in the ST7C mitochondrial genome. Aragorn is a lot more sensitive (and picks up many false positives as well), so if even aragorn can't find such tRNAs it probably doesn't exist.
TAG is used as a STOP in at least two other genes
I ran prokka with --addgenes --addmrna --kingdom Mitochondria --cds rnaolap --gcode 1 on Greg’s ST7C mitochondrial genome.--kingdom Mitochondria which ensures it searches mitochondrial databases for functional annotation. --gcode 1 to ensure it uses the standard code. prokka runs prodigal under the hood, which uses Genetic Code 11 (Bacteria, Archaea, Plastids) by default
It found one gene that end with TGA: ST7C_00011 and three genes that end with TAG: ST7C_00027 (hypothetical), ST7C_00039 (nadj), ST7C_00057 (nad4).
To assess whether these genes were predicted with the correct end codons, I compared with existing annotations of homologs in ST4 DMP/10-212.
ST7C_00011 may be a false positive gene. It is a short 43 amino acids, with only bacterial, uncharacterized, BLAST hits in public databases, (there was 1 Blasto hit but also may be a mispredicted gene) and it is entirely enveloped (on the opposite strand) by ST7C_00010, most likely the gene encoding 16S rRNA gene.
ST7C_00027 is homolog of APC25073.1, ribosomal protein S12. ST7C_00027 is length 146, APC25073.1 is length 125. They have the same start, but ST7C_00027 is about 21 amino acids longer. APC25073 stop codon is TAA (see below). Their nucleotide sequences are very similar around the TAA codon in ST4. ST7C seemingly has an insertion relative to ST4, which means the TAA gets skipped until the downstream TAG is found. This could explain why the ST7C copy ends with TAG instead of TAA. The insertion happens in a AAAAAAA stretch, suggesting it could be a homopolymer type sequencing error. However, upon checking the DNA Illumina mapping, there does not seem to be any sequencing errors in this area in ST7C.. It’s also curious that this is also the area where a tRNA is predicted to start on the same strand. You would perhaps not expect this overlap.
tRNA ST7C TCT CTA ATA GTT CAA GGG TTA .... TAGAGAG # tRNA Asn (gtt)
G T K R K K S L I V Q G L E H M T V N H V V I G S S P I *
ST7C GGT ACA AAA AGA AAA AAA TCT CTA ATA GTT CAA -GG G TTA GAA CAC ATG ACT GTT AAT CAT GTT GTT ATA GGT TCG AGT CCT ATT TAG
|| | ||| | | ||| | ||| || ||| ||| ||| || i
ST4 GGA ATG AAA AAA AAA AG- TCT TTA ATA GTT CAA AGG T
G M K K K S L * * F K G
TA A
tRNA ST4 TCT TTA ATA GTT CAA AGG TTA .... TAGAGAG # tRNA Asn (gtt)
ST7C_00039 is homolog of `APC25066.1`, NADH dehydrogenase subunit 9 (nad9). `ST7C_00039` is length 196, `APC25066.1` is also length 196. They have the same start and end. `APC25066.1` stop codon is `TAA` (see below). No insertions or deletions, also no sequencing errors in ST7C at the STOP codon. It looks like this is simply a mutation causing the difference between the two strains. Suggests the ST7C `TAG` is a true stop codon
S P K F K S Y Y N Y D N F Y S F *
ST7C TCT CCT AAA TTT AAA TCT TAT TAT AAT TAT GAT AAT TTT TAT TCA TTT TAG
||. ||. ||| ||| ||| ||| ||| ||| | | | || ||| ||| ||| | || ||
ST4 TCA CCA AAA TTT AAA TCT TAT TAT ATG TTT AAT AAT TTT TAT ACT TTA TAA
S P K F K S Y Y M F N N F Y T L *
ST7C_00057 is homolog of APC25056.1 (nad4). ST7C_00057 is length 73, APC25056.1 is length 487! `ST7C_00057` matches only the C-terminal part of `APC25056.1`. Possibly because `ST7C_00057` is the last gene on the linear representation of the circular MRO genome. Prokka calls prodigal with `meta` and `-c`, which maybe (unsure what is meant with running off edges) means that genes do not wrap around from end to start of FASTA. `APC25056.1` stop codon is `TAA` (see below). `ST7C_00057` and `APC25056` end at the same position, suggesting the `TAG` of ST7C is at least at the right spot.
F I I G I Y P T F I L D Y L N M S V S F L L N I V S C *
ST7C TTT ATT ATA GGT ATT TAT CCT ACT TTC ATT TTA GAT TAT TTG AAT ATG TCA GTT AGT TTT TTA TTA AAT ATA GTA TCT TGT TAG
||| | || || | ||| ||| ||| || | ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| || | | | ||| ||
ST4 TTT TTA ATT GGA TTA TAT CCT ACT TTT TTA TTA GAT TAT TTG AAT ATG TCA GTT AGT TTT TTA TTA AAT TTA ATT TGT TGT TAA
F L I G L Y P T F L L D Y L N M S V S F L L N L I C C *
ST7C nuclear genome encodes a mitochondrial release factors likely to bind to UAG
There are broadly speaking two mitochondrial release factors:
mtRF1, which bindsUAAandUAGmtRF2, which bindsUAAandUGA
Using human, yeast and annotated Blasto copies, and BLASTP vs all st7 strains, I’ve identified the ST7 Blasto homologs:
mtRF1:ST7C_HKIIKG_7787_gene(Seq23) (reciprocal blast against uniprot returns peptide release factor 1,prfA)mtRF2:ST7C_TYYQWW_4460_gene(Seq9) (reciprocal blast against uniprot returns peptide release factor 2,prfB)- Both are encoded in the nuclear genome. Targeting peptides?
ST7C_HKIIKG_7787_gene has the glutamine Gln181 residue: GVHRV--Q--RVPETESQGRIHTSTMTVAVL (Q = glutamine). Gln181 is critical for recognizing the G in TAG, and it having it suggests that TAG still functions as a STOP codon (Zihala & Elias, 2019). It also does not have a serine at position 206, which is associated with being able to recognize both A and G in the second codon position (allowing recognition of TAA and TGA) in RF2. It also has a T, or Thr186, responsible for discrimination of adenine against guanine and the second codon position. This means it should not recognize TGA.
There are hints that orf160 actually starts downstream of the in-frame STOP codons
- All Blasto mitochondrial genomes have a 56 or 55 bp overlap between the end of nad7 and the ‘M’ start of orf160, which is really an unusually large overlap. This suggests that perhaps the true start of orf160 is a bit after the ‘M’ start? The first 55/56 bp of the supposed orf160 gene are much more conserved than the remainder of the gene, (see below) which is possibly explained by the fact that this part is for nad7 and not for rpl10 or orf160
- The AlphaFold structure’s N-terminus has a helical structure of about ~25aa but ‘sticks’ out from the rest of the structure. Perhaps unnatural sticking out?
- The AlphaFold + FoldSeek hits of the ST4 query do invariably not align the first 21-28 or so amino acids of their target hits. Suggesting perhaps this N-terminal part of the query is not homologous to RPL10?
- AlphaFold3 + Foldseek with alternative V-14 start also recovers mitochondrial RPL10 hits, suggesting this could still function possibly with the V as the start
Some STs also have genes encoding RPS4 that appears to lack an ATG start codon.
FoldMason alignment of orf160 suggests ATG is not used as a START codon
orf160 sequences within Blastocystis are too divergent too get sensible alignments with mafft. Since Foldseek gave us confident hits that it encodes RPL10, we can infer that the structure is fairly well conserved, even if the primary sequence is not. Hence, ideally, we would like to use structural information to align all orf160 homologs. We can do this with FoldMason, another tool from the same guys that developed Foldseek.
I ran AlphaFold3 on all Jacob et al orf160 predicted amino acid sequences (replacing ‘*’ with ‘X’) of many different subtypes, and used FoldMason MSA to align these predicted structures with predicted structures of the best Foldseek hits (using ST4 homolog as query).
The first 20-30 or so amino acids of the Blasto homologs do indeed not seem to align to the core RPL10 domain. Further suggesting that perhaps this part is not actually translated by Blasto
After those 20-30 amino acids, there was not a single Methionine strictly conserved across all Blasto subtypes. So, if it indeeds starts after the TAG STOP codon, and it is indeed not a pseudogene, then it must use an alternative START codon.
If so, then what is that START codon?
The search for alternative initiatior tRNAs
According to Jacob et al, 2016 and Brocal & Clark 2008, the MRO genome encodes three tRNA genes. Two elongator tRNAs (Me1 and Me2), that are close to each other on the mitochondrial genome, and one initiator tRNA (Mf), located in between the small (12S) and large (16S) subunit rRNA genes.
All of them have the CAU anticodon, which would match with the ATG start codon. So, perhaps there is another initiator tRNA gene, either on the mitochondrial or on the nuclear genome, that could fit with an alternative start in the orf160 gene.
Two Ile (aat anticodon) tRNA genes were possibly missed by previous annotations
Visual inspection of Blasto MRO genome gene structures (as annotated in the literature), revealed that some areas were unannotated, or unaccounted for. In particular, within a block of tRNA genes there seems to be a gap large enough to fit exactly one additional tRNA gene. Also between nad6 and 12S rRNA and between nad2 and nad11 there is such a gap. Perhaps these areas could encode for a tRNA that is an alternative initiator.
The literature annotations for all subtypes are derived from Perez-Brocal and Clark, 2008, who sequenced MRO genomes of Blastocystis DMP_02_328 (ST4) and NandII (ST1). tRNA genes for those two strains were predicted with tRNAscan-SE.
tRNAscan-SE is very specific, but can lack sensitivity in mitochondrial genomes, as its hidden markov models are either from Bacteria, Archaea, Eukaryotic Nuclear, or metazoan/mammalian mitochondrial genomes. Aragorn on the other hand is highly sensitive, but not very specific. Seems to be also better at detecting mitochondrial type tRNAs
I applied aragorn in mitochondrial mode to all Jacob et al blasto MRO genomes:
aragorn -mt -mtd -c -d -e -rp -br -wa -o $base.aragorn.out $fasta aragorn -mt -mtd -c -d -e -rp -o $base.aragorn.full.out $fasta
I compared with pre-existing annotations. It found loads of false positives, but there were a few interesting hits:
1. A tRNA gene in all 3 ST3 strains between nad6 and 12S rRNA gene, with anticodon `aat` (codon `ATT`):
t
t-a
t-a
t.t
a-t
g+t
t.t
t-a
t-a a
t tatat a
ataa a !!!!! a
t aaat atata t
a !!!! t tt
t ttta t
agat a a
a-tga
g-c
a-t
g+t
a-t
t t
t g
aat
mtRNA-Ile(aat)
77 bases, %GC = 9.1
Sequence [1139,1215]
Score = 102.678
2. A tRNA in ST6, with aat as anticodon, in between Asn and Leu tRNAs in a tRNA block:
t
t-a
a-t
a-t
a-t
a-t
t.t
at.t gaa
t tgattt a
a a :!!!!! t
t aac tctaaa t
a !!! a ttat
ttg a
a t g
a-tga
a-t
a-t
t.t
a a
t a
c g
aat
Possible Pseudogene
mtRNA-Ile(aat)
73 bases, %GC = 12.3
Sequence [28432,28504]
Score = 94.5612
So both possibly new tRNAs are with the anticodon aat, which recognizes codon TTA. The original annotation already had gat anticodon Ile, and ATC as codon. So perhaps some redundancy here, and so perhaps these new hits are pseudogenes? aat could according to theory, only possibly base pair with codon TTA
Blastocystis imports many tRNAs from the cytosol
The Jacob et al blasto MRO genomes seem to encode tRNAs for only 14 amino acids, and are lacking tRNAs for Ser, Val, Thr, Gln, Arg and Gly. Brocal & Clark, 2008 had already observed this paucity of tRNA genes and called this the “the most dramatic case of tRNA gene loss observed within the stramenopiles”. The mtDNA encoded proteins do include these amino acids, so their tRNAs must be encoded on the nuclear genome, and imported into the mitochondrion. Maybe the alternative initiator tRNA is also imported?
The mitochondrial gene for RPS4 also lacks a START codon in several STs
RPS4 also lacks a typical start codon in ST1, ST2, ST4 and ST8 (Jacob et al 2016). If they use an alternative START, it may be the same START that orf160 is using. RPS4 in Blastocystis seems curiously also twice the size of that in Proteromonas lacertae.
I collected RPS4 homologs from Jacob et al MRO genomes and Proteromonas lacertae, and did a AF3 + FoldMason alignment, and then back transformed it into a codon alignment.
RPS4 of Blasto is indeed about 2x the size of that of Proteromonas and Proteromonas copy starts with ATG and aligns with the N-terminus of the Blastocystis copies.
I compared the RPS4 codon alignment with that of the RPL10 / orf160 alignment.
Also here there is no codon site that is exclusively ATG among all lineages, suggesting this gene is using an alternative start - Codons that are close to perfect conservation, in the beginning area of RPL10 and RPS4: AAA, AAT, …
- I already had the RPL10 FoldMason alignment (see above).
Ideas to explore
- Check RNA expression levels
- What makes a start codon a start codon?
- check for shine dalgarno sequences, possibly after the in-frame start codon?
- Andrew: I guess to know if this is ‘significant’ you’d have to look at the density of these kinds of codons throughout the whole sequence. Since mtDNAs like this are A+T-rich, codons that are ‘close’ to ‘ATG’ might not be so rare.
- Andrew: One way to test this is to look at the distances from the in-frame stop codon to all ‘near start codons’ in the sequence and and add them all up. Then randomly choose the same number of codon positions in that same interval (not allowing choosing the same position twice) and calculate the same distance. Do the latter step 100 times and that gives you a distribution on what would a uniform distribution look like. If the ‘true’ summed distance is smaller than the random distribution, then it would suggest that ‘near start codons’ are clustered towards the beginning.
Useful data
ST4 DMP/10-212 orf160 sequence (nucleotide, amino acid, codons)
>KU900236_9694-10176 KU900236
ATGTTACAACTGTTATTGGTAGTTTATATATTGTTTTTGGTGATATTGATCGATAATATT
AAATTAAAATATTTACGTCGTAGTTTTAATTTTAAAAAAATTTCTATGTTTGAGCAATAT
CATTATATTGTAATATGTTCAAACTTACAAATAGTTTCAAATTTAAAACAATTATTACTT
CAATATCCTACAATAAAAATTCAATTTTTTAAAAAATCAAACAGAAATATCTATTTAATT
TTTTTATTACCATATTTAACTAATTCATTAATTTTATTGGGATGTAACGAACTTAATGTT
TTTTTTAAATTGTGTGAATGTGTTTCTAAAAATATTTTGTTTATAAAAGTACAAAATACG
ATTTATTCTATAAATCAATTTATGGATTGTTCTTCAAATCAAATTATGTTTGGACAAACT
TTAAATTCACTTTATTATAATTTGATTAAAGTTTTTTATTCTTTTTCTTTATTACATAAA
TAA
>KU900236_9694-10176 KU900236
MLQLLLVVYILFLVILIDNIKLKYLRRSFNFKKISMFEQYHYIVICSNLQIVSNLKQLLL
QYPTIKIQFFKKSNRNIYLIFLLPYLTNSLILLGCNELNVFFKLCECVSKNILFIKVQNT
IYSINQFMDCSSNQIMFGQTLNSLYYNLIKVFYSFSLLHK*
>KU900236_9694-10176
ATG TTA CAA CTG TTA TTG GTA GTT TAT ATA TTG TTT TTG GTG ATA TTG ATC GAT AAT ATT
M L Q L L L V V Y I L F L V I L I D N I
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
* $$$ &&& &&& ^^^ $$$ &&& %%% ###
AAA TTA AAA TAT TTA CGT CGT AGT TTT AAT TTT AAA AAA ATT TCT ATG TTT GAG CAA TAT
K L K Y L R R S F N F K K I S M F E Q Y
^^^
CAT TAT ATT GTA ATA TGT TCA AAC TTA CAA ATA GTT TCA AAT TTA AAA CAA TTA TTA CTT
H Y I V I C S N L Q I V S N L K Q L L L
CAA TAT CCT ACA ATA AAA ATT CAA TTT TTT AAA AAA TCA AAC AGA AAT ATC TAT TTA ATT
Q Y P T I K I Q F F K K S N R N I Y L I
TTT TTA TTA CCA TAT TTA ACT AAT TCA TTA ATT TTA TTG GGA TGT AAC GAA CTT AAT GTT
F L L P Y L T N S L I L L G C N E L N V
TTT TTT AAA TTG TGT GAA TGT GTT TCT AAA AAT ATT TTG TTT ATA AAA GTA CAA AAT ACG
F F K L C E C V S K N I L F I K V Q N T
ATT TAT TCT ATA AAT CAA TTT ATG GAT TGT TCT TCA AAT CAA ATT ATG TTT GGA CAA ACT
I Y S I N Q F M D C S S N Q I M F G Q T
TTA AAT TCA CTT TAT TAT AAT TTG ATT AAA GTT TTT TAT TCT TTT TCT TTA TTA CAT AAA
L N S L Y Y N L I K V F Y S F S L L H K
TAA
*
