orf160 in Blastocystis mitochondrial genomes

Knowledge from the Literature: Jacob et al, 2016, GBE

Basic properties

orf160 is an enigmatic gene found in Blastocystis mitochondrial genomes. Named as such because, even though its not technically an ORF (see below), it looks like one, and its 158 and 162 amino acids long (about ~480 nt)

In-frame STOP codons

Only in ST4 DMP/02-328 and ST4 DMP/10-212 is this a bona fide ORF. In all other described subtypes, it has one or two in-frame STOP codons. In ST4, position 9 is TGG(Trp) or TAT(Tyr). Position 11 is TTG(Leu). In all other strains, positions 9 and 11 are TAG STOP codons. In ST8 DMP/08-128, position 2 is a TGA STOP codon. Most other STOP codons in the Blastocystis mitochondrial genomes are TAA. No alternative START codons (CTG or TTG) are found after position 9

No clear homologs

It has extremely low sequence similarity to anything in the public sequence databases, and no BLAST hits outside of Blastocystis were identified. Even within Blastocystis, BLAST hits only have ~27.5% sequence similarity at the amino acid level. The %GC and amino acid composition are somewhat reminiscent of some ribosomal proteins

It is not a pseudogene

It is unlikely to be a pseudogene:

If the in-frame STOP codon was due to pseudogenization, it was a pseudogene already in the Blastocystis ancestor. Given the age of the pseudogene, we would expect to see more in-frame STOP codons
There are no other in-frame STOP codons
The ORF is still pretty long, 480 nucleotides
dN/dS ratio < 1, indicating negative selection

orf160 (negative strand) overlaps with its upstream neighbor, nad7 (negative strand)

Jacob et al identified a 55 or 56 bp overlap on the 5 prime end of orf160. This means that the first 19 codons / aa’s of orf160 overlap with the 3 prime end of nad7. We checked with the Blastocystis NandII strain the overlap with nad7

No evidence as of yet for transcription of this gene

Jacob et al tried RT-PCR. Roger lab with Eleni Gentekaki also found no EST or RNAseq evidence

Hypothesis 1: TAG has been re-assigned to a sense codon

If this is true, all mitochondrial genes should end with TAA or TGA, which according to Jacob et al is not always true. If this is true, we should be able to find a tRNA with an anticodon able to recognize TAG. Jacob et al was unable to find any in the ST1 nuclear genome

Hypothesis 2: TAG in position 9 is RNA edited to a sense codon

Unable to check this hypothesis because no RNA data available

Hypothesis 3: Translational read-through circumvents translation of TAG

Hypothesis 4: Alternative start codon? Other than CTG or TTG

Joran's work in June 2025

orf160 encodes for mitochondrial ribosomal subunit RPL10

I took the protein sequence of one of the two only proper orf160 copies from ST4 DMP/10-212 (APC25055.1) and threw it into online AlphaFold. It gave me a decent structure, and threw that into Foldseek. Among the AFDB-SwissProt I got top hits to 39S ribosomal protein L10, mitochondrial among Eukaryotes, and 50S ribosomal protein L10 among Bacteria. Apparently, RPL10 is either absent or unrecognized in many eukaryotic lineages (Ryo Harada et al 2025, Journal of Eukaryotic Microbiology). Among all those, the first 21-25 or so amino acids on the N-terminal region did not align in the Foldseek alignment. The unaligned N-terminal part of the query also seems to sticks out of the structure a bit.

ST4 DMP/10-212 orf160 does not seem to have any targeting signals

I tried Deeploc and SignalP

orf160 in Blastocystis genomes ST7c,e,g,h,b also have the in-frame STOP codons

As a query, I used the publically available orf160 copy of ST7B, CU914152.1, which is the complete nucleotide sequence of ST7B mitochondrial genome. The relevant sequence of orf160 in that, (coordinates 13813-14292), with the in-frame stop codon annotated:

>CU914152_13813-14292 CU914152
  .  .  .  .  .  .  .  .TAG.
ATGTTACCACTGTTATTGGTAGTTTAGATATTGTTTTCGGTGATATTGATAGATAATTTA
AAAATTGTACGTAAATATAATTATATTTTAAAATTTAATCAATATTTTAAAAATTATAAA
TATATGTTATTTTGTGATAATACTAGTTTAAATTTGAATTTATATAAACATGAAATTTTG
TTAAATCCTAATGTTAAATGTATTTTCTTAAAAAAATTTAAATGTATTGATAATTTAACA
TATTTTAATTCTAATTTAAAAAATTCAACAGTAATTTTTTGTACTAATGATTTACAAACT
TTATATTTAATAATTTCTAAATTACAAACTAATATATTATTTTGTAAAATACAAAATAAT
TATTATTCTTTAAAAAATTTAAATACTTATATTAATTCTATATATGGATTAGTTAATTAT
TTAGATAATTATATGAGTAATTTTATATTTTTATTTCAACAAATTTCTAAAAAACAATAA

I found that strains B, E (Seq_114_MRO), and G had perfect matches. C, E (Seq_115_MRO) and H had some frameshifts, which introduced many other STOP codons. The frameshifts always occurred in this area TGTATTTTCTTAAAAAAATTT, which is a bit downstream of the in-frame STOP codon

Errors were unsurprisingly in the long homopolymer A region. If you had 7, no frameshifts. C and H frameshifts could be explained by persistent, unpolished sequencing errors:

Canu struggles sometimes with circular genomes, and the MRO contigs were 1,5x - 2x too big. The ‘superfluous’ regions had for some reason poor illumina sequencing coverage, and therefore had poor polishing. Unknowing about this, (totally understandable), Greg had cut the ‘good’ parts and kept the ‘bad’ parts. orf160 was found in ‘good’ and ‘bad’ parts, and Greg had thus kept the ‘bad’ orf160

After correcting for this, by recutting the MRO contigs, the orf160 copies now had perfect matches to the CU914152 copy. The E (Seq_114_MRO) frameshift is also probably a persistent, unpolished sequencing error. However, the superfluous area of Seq_114_MRO did not have an extra orf160 copy, and no ‘good’ copy existed. Yet, if we look at the Illumina coverage of orf160 we could see that the homopolymer area in question was not well covered. Hence, its likely a remaining sequencing error

The TAG codon is probably not reassigned to a sense codon

No tRNA genes with anticodon able to recognize TAG

The perfect anticodon to TAG would be CTA, or more specifically CUA in RNA terms:

3'-AUC-5' anticodon in tRNA
   |||
5'-UAG-3' codon in mRNA

However, this may also be possible? TTA, or UUA in RNA terms

3'-AUU-5' anticodon in tRNA
   |||
5'-UAG-3' codon in mRNA

But UUA anticodon would fit much better with TAA, a STOP codon. So unlikely that there is actually a UUA/TTA tRNA

3'-AUU-5' anticodon in tRNA
   |||
5'-UAA-3' codon in mRNA

I ran tRNAscan-SE on the nuclear and mitochondrial genome of ST7C (assembled by Greg and I), but could not find any CUA//CTA or UUA//TTA tRNAs. prokka, which runs aragorn under the hood, also did not find any tRNAs with these anticodons in the ST7C mitochondrial genome. Aragorn is a lot more sensitive (and picks up many false positives as well), so if even aragorn can't find such tRNAs it probably doesn't exist.

TAG is used as a STOP in at least two other genes

I ran prokka with --addgenes --addmrna --kingdom Mitochondria --cds rnaolap --gcode 1 on Greg’s ST7C mitochondrial genome.--kingdom Mitochondria which ensures it searches mitochondrial databases for functional annotation. --gcode 1 to ensure it uses the standard code. prokka runs prodigal under the hood, which uses Genetic Code 11 (Bacteria, Archaea, Plastids) by default

It found one gene that end with TGA: ST7C_00011 and three genes that end with TAG: ST7C_00027 (hypothetical), ST7C_00039 (nadj), ST7C_00057 (nad4).

To assess whether these genes were predicted with the correct end codons, I compared with existing annotations of homologs in ST4 DMP/10-212.

ST7C_00011 may be a false positive gene. It is a short 43 amino acids, with only bacterial, uncharacterized, BLAST hits in public databases, (there was 1 Blasto hit but also may be a mispredicted gene) and it is entirely enveloped (on the opposite strand) by ST7C_00010, most likely the gene encoding 16S rRNA gene.

ST7C_00027 is homolog of APC25073.1, ribosomal protein S12. ST7C_00027 is length 146, APC25073.1 is length 125. They have the same start, but ST7C_00027 is about 21 amino acids longer. APC25073 stop codon is TAA (see below). Their nucleotide sequences are very similar around the TAA codon in ST4. ST7C seemingly has an insertion relative to ST4, which means the TAA gets skipped until the downstream TAG is found. This could explain why the ST7C copy ends with TAG instead of TAA. The insertion happens in a AAAAAAA stretch, suggesting it could be a homopolymer type sequencing error. However, upon checking the DNA Illumina mapping, there does not seem to be any sequencing errors in this area in ST7C.. It’s also curious that this is also the area where a tRNA is predicted to start on the same strand. You would perhaps not expect this overlap.

 tRNA ST7C                    TCT CTA ATA GTT CAA GGG TTA .... TAGAGAG     # tRNA Asn (gtt)

       G   T   K   R   K   K   S   L   I   V   Q    G    L   E   H   M   T   V   N   H   V   V   I   G   S   S   P   I   *
 ST7C GGT ACA AAA AGA AAA AAA TCT CTA ATA GTT CAA -GG G TTA GAA CAC ATG ACT GTT AAT CAT GTT GTT ATA GGT TCG AGT CCT ATT TAG
      ||  |   ||| | | ||| |   |||  || ||| ||| |||  || i
 ST4  GGA ATG AAA AAA AAA AG- TCT TTA ATA GTT CAA AGG T
       G   M   K   K   K    S    L   *   *   F   K   G
                                   TA A

 tRNA ST4                     TCT TTA ATA GTT CAA AGG TTA .... TAGAGAG     # tRNA Asn (gtt)

ST7C_00039 is homolog of `APC25066.1`, NADH dehydrogenase subunit 9 (nad9). `ST7C_00039` is length 196, `APC25066.1` is also length 196. They have the same start and end. `APC25066.1` stop codon is `TAA` (see below). No insertions or deletions, also no sequencing errors in ST7C at the STOP codon. It looks like this is simply a mutation causing the difference between the two strains. Suggests the ST7C `TAG` is a true stop codon

      S   P   K   F   K   S   Y   Y   N   Y   D   N   F   Y   S   F   *
ST7C TCT CCT AAA TTT AAA TCT TAT TAT AAT TAT GAT AAT TTT TAT TCA TTT TAG
     ||. ||. ||| ||| ||| ||| ||| ||| |   | |  || ||| ||| |||  |  ||  ||   
ST4  TCA CCA AAA TTT AAA TCT TAT TAT ATG TTT AAT AAT TTT TAT ACT TTA TAA
      S   P   K   F   K   S   Y   Y   M   F   N   N   F   Y   T   L   *

ST7C_00057 is homolog of APC25056.1 (nad4). ST7C_00057 is length 73, APC25056.1 is length 487! `ST7C_00057` matches only the C-terminal part of `APC25056.1`. Possibly because `ST7C_00057` is the last gene on the linear representation of the circular MRO genome. Prokka calls prodigal with `meta` and `-c`, which maybe (unsure what is meant with running off edges) means that genes do not wrap around from end to start of FASTA. `APC25056.1` stop codon is `TAA` (see below). `ST7C_00057` and `APC25056` end at the same position, suggesting the `TAG` of ST7C is at least at the right spot.

      F   I   I   G   I   Y   P   T   F   I   L   D   Y   L   N   M   S   V   S   F   L   L   N   I   V   S   C   *
ST7C TTT ATT ATA GGT ATT TAT CCT ACT TTC ATT TTA GAT TAT TTG AAT ATG TCA GTT AGT TTT TTA TTA AAT ATA GTA TCT TGT TAG
     |||  |  ||  ||   |  ||| ||| ||| ||   |  ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| |||  ||  |  | | ||| ||
ST4  TTT TTA ATT GGA TTA TAT CCT ACT TTT TTA TTA GAT TAT TTG AAT ATG TCA GTT AGT TTT TTA TTA AAT TTA ATT TGT TGT TAA
      F   L   I   G   L   Y   P   T   F   L   L   D   Y   L   N   M   S   V   S   F   L   L   N   L   I   C   C   *

ST7C nuclear genome encodes a mitochondrial release factors likely to bind to UAG

There are broadly speaking two mitochondrial release factors:

mtRF1, which binds UAA and UAG
mtRF2, which binds UAA and UGA

Using human, yeast and annotated Blasto copies, and BLASTP vs all st7 strains, I’ve identified the ST7 Blasto homologs:

mtRF1: ST7C_HKIIKG_7787_gene (Seq23) (reciprocal blast against uniprot returns peptide release factor 1, prfA)
mtRF2: ST7C_TYYQWW_4460_gene (Seq9) (reciprocal blast against uniprot returns peptide release factor 2, prfB)
Both are encoded in the nuclear genome. Targeting peptides?

ST7C_HKIIKG_7787_gene has the glutamine Gln181 residue: GVHRV--Q--RVPETESQGRIHTSTMTVAVL (Q = glutamine). Gln181 is critical for recognizing the G in TAG, and it having it suggests that TAG still functions as a STOP codon (Zihala & Elias, 2019). It also does not have a serine at position 206, which is associated with being able to recognize both A and G in the second codon position (allowing recognition of TAA and TGA) in RF2. It also has a T, or Thr186, responsible for discrimination of adenine against guanine and the second codon position. This means it should not recognize TGA.

There are hints that orf160 actually starts downstream of the in-frame STOP codons

All Blasto mitochondrial genomes have a 56 or 55 bp overlap between the end of nad7 and the ‘M’ start of orf160, which is really an unusually large overlap. This suggests that perhaps the true start of orf160 is a bit after the ‘M’ start? The first 55/56 bp of the supposed orf160 gene are much more conserved than the remainder of the gene, (see below) which is possibly explained by the fact that this part is for nad7 and not for rpl10 or orf160

The AlphaFold structure’s N-terminus has a helical structure of about ~25aa but ‘sticks’ out from the rest of the structure. Perhaps unnatural sticking out?

The AlphaFold + FoldSeek hits of the ST4 query do invariably not align the first 21-28 or so amino acids of their target hits. Suggesting perhaps this N-terminal part of the query is not homologous to RPL10?

AlphaFold3 + Foldseek with alternative V-14 start also recovers mitochondrial RPL10 hits, suggesting this could still function possibly with the V as the start

Some STs also have genes encoding RPS4 that appears to lack an ATG start codon.

FoldMason alignment of orf160 suggests ATG is not used as a START codon

orf160 sequences within Blastocystis are too divergent to get sensible alignments with mafft. Since Foldseek gave us confident hits that it encodes RPL10, we can infer that the structure is fairly well conserved, even if the primary sequence is not. Hence, ideally, we would like to use structural information to align all orf160 homologs. We can do this with FoldMason, another tool from the same guys that developed Foldseek.

I ran AlphaFold3 on all Jacob et al orf160 predicted amino acid sequences (replacing ‘*’ with ‘X’) of many different subtypes, and used FoldMason MSA to align these predicted structures with predicted structures of the best Foldseek hits (using ST4 homolog as query).

The first 20-30 or so amino acids of the Blasto homologs do indeed not seem to align to the core RPL10 domain. Further suggesting that perhaps this part is not actually translated by Blasto

After those 20-30 amino acids, there was not a single Methionine strictly conserved across all Blasto subtypes. So, if it indeeds starts after the TAG STOP codon, and it is indeed not a pseudogene, then it must use an alternative START codon.

If so, then what is that START codon?

The search for alternative initiatior tRNAs

According to Jacob et al, 2016 and Brocal & Clark 2008, the MRO genome encodes three tRNA genes. Two elongator tRNAs (Me1 and Me2), that are close to each other on the mitochondrial genome, and one initiator tRNA (Mf), located in between the small (12S) and large (16S) subunit rRNA genes.

All of them have the CAU anticodon, which would match with the ATG start codon. So, perhaps there is another initiator tRNA gene, either on the mitochondrial or on the nuclear genome, that could fit with an alternative start in the orf160 gene.

Two Ile (aat anticodon) tRNA genes were possibly missed by previous annotations

Visual inspection of Blasto MRO genome gene structures (as annotated in the literature), revealed that some areas were unannotated, or unaccounted for. In particular, within a block of tRNA genes there seems to be a gap large enough to fit exactly one additional tRNA gene. Also between nad6 and 12S rRNA and between nad2 and nad11 there is such a gap. Perhaps these areas could encode for a tRNA that is an alternative initiator.

The literature annotations for all subtypes are derived from Perez-Brocal and Clark, 2008, who sequenced MRO genomes of Blastocystis DMP_02_328 (ST4) and NandII (ST1). tRNA genes for those two strains were predicted with tRNAscan-SE.

tRNAscan-SE is very specific, but can lack sensitivity in mitochondrial genomes, as its hidden markov models are either from Bacteria, Archaea, Eukaryotic Nuclear, or metazoan/mammalian mitochondrial genomes. Aragorn on the other hand is highly sensitive, but not very specific. Seems to be also better at detecting mitochondrial type tRNAs

I applied aragorn in mitochondrial mode to all Jacob et al blasto MRO genomes:

aragorn -mt -mtd -c -d -e -rp -br -wa -o $base.aragorn.out $fasta
aragorn -mt -mtd -c -d -e -rp -o $base.aragorn.full.out $fasta

I compared with pre-existing annotations. It found loads of false positives, but there were a few interesting hits:

1. A tRNA gene in all 3 ST3 strains between nad6 and 12S rRNA gene, with anticodon `aat` (codon `ATT`):

 
               t
             t-a
             t-a
             t.t
             a-t
             g+t
             t.t
             t-a
             t-a     a
            t   tatat a
   ataa    a    !!!!!  a
  t    aaat     atata  t
  a    !!!!    t     tt
  t    ttta     t
   agat    a     a
            a-tga
            g-c
            a-t
            g+t
            a-t
           t   t
           t   g
            aat

    mtRNA-Ile(aat)
    77 bases, %GC = 9.1
    Sequence [1139,1215]
    Score = 102.678

2. A tRNA in ST6, with aat as anticodon, in between Asn and Leu tRNAs in a tRNA block:

 
            t
          t-a
          a-t
          a-t
          a-t
          a-t
          t.t
         at.t      gaa
         t   tgattt   a
    a   a    :!!!!!   t
   t aac     tctaaa   t
   a !!!    a      ttat
     ttg     a
    a   t     g
         a-tga
         a-t
         a-t
         t.t
         a a
        t   a
        c   g
         aat


 Possible Pseudogene
 mtRNA-Ile(aat)
 73 bases, %GC = 12.3
 Sequence [28432,28504]
 Score = 94.5612

So both possibly new tRNAs are with the anticodon aat, which recognizes codon TTA. The original annotation already had gat anticodon Ile, and ATC as codon. So perhaps some redundancy here, and so perhaps these new hits are pseudogenes? aat could according to theory, only possibly base pair with codon TTA

Blastocystis imports many tRNAs from the cytosol

The Jacob et al blasto MRO genomes seem to encode tRNAs for only 14 amino acids, and are lacking tRNAs for Ser, Val, Thr, Gln, Arg and Gly. Brocal & Clark, 2008 had already observed this paucity of tRNA genes and called this the “the most dramatic case of tRNA gene loss observed within the stramenopiles”. The mtDNA encoded proteins do include these amino acids, so their tRNAs must be encoded on the nuclear genome, and imported into the mitochondrion. Maybe the alternative initiator tRNA is also imported?

The mitochondrial gene for RPS4 also lacks a START codon in several STs

RPS4 also lacks a typical start codon in ST1, ST2, ST4 and ST8 (Jacob et al 2016). If they use an alternative START, it may be the same START that orf160 is using. RPS4 in Blastocystis seems curiously also twice the size of that in Proteromonas lacertae.

I collected RPS4 homologs from Jacob et al MRO genomes and Proteromonas lacertae, and did a AF3 + FoldMason alignment, and then back transformed it into a codon alignment.

The AF3 predicted structures, even between closely related Blasto subtypes, do not seem to align very well in 3D.

RPS4 of Blasto is indeed about 2x the size of that of Proteromonas. The Proteromonas copy starts with ATG. FoldMason aligned the Proteromonas copy with the N-terminus of the Blastocystis copies. However, it seemingly only aligns well (in the sequence alignment) with the Blasto copies in a particular section: the section that yields good Foldseek hits with public RPS4 homologs in the databases.

I compared the RPS4 codon alignment with that of the RPL10 / orf160 alignment. Codons that seem to be conserved (but not perfectly), in the beginning area of both RPL10 and RPS4 are AAA, AAT.

However, since the sequences don't even align that well, I'm not even sure where the true RPS4 gene starts and end in Blastocystis and Proteromonas mtDNA.

Little to no expression of orf160 in regular and riboZeroPlus RNAseq data of ST7C

To see whether the mysterious orf160 is actually expressed, I inspected the regular RNAseq, that is, mRNA sequencing via oligo-dT primers/probes, by mapping it on the ST7C genome, that included the MRO genome.

Here is the visual overview:

You can see some RNAseq reads overlapping with ORF160 on its 5' end, and in the proper direction too. However, these may be reads coming from the immediately upstream gene nad7 (in the IGV figure wrongly annotated as ndhH)

Since a regular, polyA capturing RNAseq does by its nature not capture mitochondrial transcripts, which are not believed to be poly-adenylated, this lack in expression may come from a pure lack of captured mitochondrial transcripts in the experimental design.

I therefore tried to resequence the same RNA sample using a different kit, the riboZeroPlus kit. This kit uses a set of custom designed probes to remove rRNA transcripts from the total RNA extract prior to library prep and sequencing.

I used the following probes:

- 18S rRNA probes (one per line):

TTTCATAAACAAACCAAAAAATCGACTATGAAAGCCAATCTTATTATTCC
CAAACACTTTCAATAAATTATCTAAACTTCAACTACGAGCTTTTTAACTG
TTATCCATATAGAAACTATTCCAAATAAACTATAACTGATATAATGAGCC
CTAACAAGCATGCGATAAAGTCAACAATTATTATTACTCACAATTCAATT
TAGCTTTCGTTCTTGATTAATGAAAACATCCTTGGTAAATGCTTTCGCAC
CAGATACTCGTTGAATAGTTCAGTGTCGCGCGCGTGCAGCCCAGAACATC
CTAAAACTATTTAGACTTACACATGCATGGCTTAATCTTTGAGACGAGCG
CCATGGTAGTCCAATACACTACCATCGAAAGCTGATAGGGCAGAAACTTG
GAAAAATTACAAGCATCAATCCCCATCACGAACTATTTTCAAAAGATTTC
AAATCATAGAATTTCACCTCTAGCTATTGAATATGAATACCCCCAACTGT
TCACCTTCCTCTAGATGATAAGATTTACACGACTTCTCTTCAACTATCTA
ATAAGTACTTCTTTAATGGTTGCCCATCAAAGAAAACACATGTATTAGCC
ACTAACTCCTAGTCGGTATCGTTTATAGCTAAGACTACGAGGGTATCTAA
CTATCAATCTGTCAATCCTTCCTATGTCTGGACCTGGTAAGTTTCCCCGT
TCCTTGCGGAACCATGGCACCCACCTGGATGTCGATAACTTACATAAAAG
GATTTATTGTCACTACCTCCCTGTGTCAGGATTGGGTAATTTACGCGCCT
ATAATTAAAAATCCAAAGTGTTCACCGGATCATCCAATCGGTAGGTGCGA
AAGGGCAGGGACGTAATCAACGCAAGTTGATGACTTGCATTTACTAGGAA
CCTGTTATTGCTTCCAGCTTCCCCGTACTCAAACGCACAGTGTCCCTCTA
ACAATGGGGCATTACTAAAATCCCATTTCATCCAACTAATAGGCGGAAGT
AACTGAACAGTCCGCTTTAAACACTCTAATTTTCTCACAGTAAATGACCA
TGTGGTAGCCATCTCTCAGGCTCCCTCTCCGAAATCGAACCCAAATTCTT
ACTCCCCCCGGAACCCAAAGACTTTGATTTCTCATAAGGTACTAATAGAC
TTGTTTATCGATAACGATTGTACATTGTTCTCAATTCAATTACAAAACCA

- 28S rRNA probes (one per line)

CTAACAATGTCTCCCACGTGGGTTGCAACTCGAGAGAGAAGCTTACACAT
AGCCTTTGATGGAGTTTACCACCAACTTCGAGCTGCAATCCCAAACAACT
AAGCCATCACCCCATATTATGGAATAAGTAAAACAACATTAGAGGTAGTG
TCCATGCATCATTCAACCACTCCTACGCTTAACCCCTCCACGATTTCAAG
ATTCAAAATATTGAATTCCTTTACCAATAACAAAACCTTTTCGCGGATTC
GTCGTCTACAAAGGATCTTTGTTCATTGACCATTAAAAATGCTATCAGGG
AGTCCAGCTTACCCGGAATGGCCCACTAGCAACTACTATTCAAAATTACA
AGGCTGTTCGCTTAAGCGCCATCCATTTTCAGGGCTACTTCATTCGGCAG
TTTTCAAAGTGCTTTTCATCTTTCCCTCACGGTACTTGTTCGCTATCGGT
AGCACTGGGCAGAATTCACATTGTGTCAATATATCTTTCACACTATCACA
TTTATCAGAGATGCAAGACCGGTAGTTGTTGCTAGCTCTCTTTAGACAAA
TTTTCTATCCAACTGAGCGAACAATTAGGCGCCGTACCATATCGTTCGGT
AGGTTGACAAATTGCAGAAATAGTTAATAGGGCCGTCCACCTCCCCAGGG
GTTTCAAGACGGGACGGAGAAGCAGTTATTAGGAAAGAGGAAATTCAGTA
AAGCAACTATAATATCTTACCCATTCAAAGTTTGAGAATAGGTCCAGGAT
AAATGTGTTCCCAAAGGGAGGGAAATAATATTACTTTTCAAGGACCCATT
AAGCCGTATCTACTCAAATAGGCTTCTTTATATAGGTCACATCCTTTGGT
CTGCTTCACAAGTACAATACACTATGCAAATACAGGGTTTTCACCTTCTA
GCTACTTCCACCAAGATCTGCACTAATGGACATTCCATATAAGTTTACAC
CATTATTCTATTAACTAGAGGCTATTCACCTTGGAGACCTGATGCGGTTA
GAGAAGAGGTAATAAGGGAAAGGGAATTAATTGATATTTACCAATTTAAC
TACATATTTTAGGAGGGCTTCATGATTAGAGGCTTTCATCACTACGACCC
CGTTCAAAGATTCAATGACTCACAGACTTCTGCAGTTCGCATTACGTATC
TCTCACATTTTACCCAGTCTGCAAGGTATTGGTAGGAAGAGCCGACATCG
AGTTCAACACGATTCCTATGGAACCTTTCTCCACTTCAGTCTTCAAAGAT
CGAGAACCACTGTATTCATATCACTAACCTAGTCAATTGAACTGTTGTCG
TAGTAGACAGACATCCAAGTCAAATCACACTCCAACAAGCATACTCCCAA
AGAGAGTCATAGTTACTCCCGCCGTTTACCCGCGCTTGGTTGAATTCCTT
CATCAATCATCTCATTCATTTGATAACCAAGAACTGACGATCCTATCATT
TCTGTTACCATTCAATTCCATTTCATTGGTTCAGGAATATTAACCTGATT
ACCTTCATTACGCATTTTAGTTTAACACTAAACTACTCGCAAATATGATA
GTTCTAAAAATTCAAAAGAACTTTTTCAACGGATTTCACCTATCTCTTAG
TTTTCCTCTGCTTAGTTAGATGCTTCAATTCAGCAGGTCTTCTTGCTTGA
ATCCAATTCTCATAGTATACTGTTACTAAACAATACTTCTACACTCCACA
CCTAGCCCTCAGAGCCAATCCTTATCCCGAAGTTACGGATCTAATTTGCC
ATTCTATTTCAATGGAGGAAACTCTTAGTCAATCCACCATCAATCATCGT
TTCGTCCTATTCAGGCATAGTTCACCATCTTTCGGGTCCCACCATCTTTG
CCCTTAAAAAGAGTCTCCCACCTATTCTACACCCTCTAAGTCATTTCACA
CATACTGAAAATCAAAATCAAATGAGCTTTTACCCTTTTATTCTACGTAA
TGAGCTCATCTTAGGACACCTGTGTTATTCTTTAACAGATGTGCCGCCCC
GATAAGTCTCAATTTCTCGTTGAACTAAGTCAACTCGAAAACTTACAACC
CCTCTAATCATTCGCTTTACCTCATAAAACTAGACACAGTTGCAGCTATC
GTGTTAATTCGGATTGGGCTTTTCCCACTTCACTCGCCGTTACTAAGGGA
TCCATCACGCCTTCCTACTTGTCACCCCATAATATAACCATCTACTTGAG
CTAGCTTTAAACTCGAAATTCAAATATCTAAAGGATCGATAGGCCATATT
TAAACAGTCGGATTCCCCTTGTCCGTACCAGTTCTGAGTCAGCTATTCAT
CCCAAATTTAAAGATCAATTTGCACGTTAGAATCCACTCGAACCTCCACC
TTTATTATTGTTAACAAGAAAAGAAAACTCTTCCCAGGAGAGTAACCGAT
TACCACCACTAAACAACCACTCCTTTGCATACATTCTTATCATCACAAAC
CAAGCTCAACAGGGTCTTCTTTCCCCGCTGATATTTCCAAGCCCATTCCC

Sequencing was done at the Genomics CORE Lab in the LSRI, with Mat as contact person.

The sequencing run was excellent. Got a lot of data, and it was also really good quality.

After quality trimming the data, I mapped it to the latest version of the ST7C genome with HISAT2.

A lot of reads still mapped to the rRNA genes, but all the other areas still had more than sufficient enough coverage as well.

Importantly, the mitochondrial genome had much more reads mapping to it:

This is not just the result of a higher throughput. The throughput of the riboZeroPlus run (179 506 076 mapped reads) was about twice that of the original RNAseq run (85 875 153 reads), but far more than twice time the amount of reads now mapped to the MRO genome. (NOTE that for both IGV figures I used the same coverage scale of 2000 in the Coverage track).

What is striking is that the mitochondrial rRNA genes still had an enormous amount of coverage. Perhaps next time you want to sequence the mitochondrial transcriptome of some organism, also include probes targeting the mitochondrial rRNA genes!

Another striking thing is that it seems that RNAseq coverage of mitochondrial ribosomal genes is much lower than that of the nad genes! Exceptions seem to be rps12 and rpl16.

Zooming in to orf160 / rpl10:

Unfortunately again it seems we are not seeing any significant evidence of expression of this gene. It may be that the throughput for this gene in particular was not high enough to detect any real expression, so we can't rule it out. Any reads that are overlapping with orf160 may be 3'UTR reads from the nad7 gene upstream (here annotated as NdhH)

Ideas to explore

Check RNA expression levels
Try to sequence mitochondrial RNA or specifically orf160 RNA (RT-PCR plus sequencing - Jacob et al designed primers for orf160 and rps4, but were unsuccesful). As of July 2025, we still have (I think) total RNA extracts from ST7C, E and H in the -80 in the main lab (in one of Gregs boxes).
check for shine dalgarno sequences, possibly after the in-frame start codon?
Andrew: I guess to know if this is ‘significant’ you’d have to look at the density of these kinds of codons throughout the whole sequence. Since mtDNAs like this are A+T-rich, codons that are ‘close’ to ‘ATG’ might not be so rare.
Andrew: One way to test this is to look at the distances from the in-frame stop codon to all ‘near start codons’ in the sequence and and add them all up. Then randomly choose the same number of codon positions in that same interval (not allowing choosing the same position twice) and calculate the same distance. Do the latter step 100 times and that gives you a distribution on what would a uniform distribution look like. If the ‘true’ summed distance is smaller than the random distribution, then it would suggest that ‘near start codons’ are clustered towards the beginning.

Useful data

ST4 DMP/10-212 orf160 sequence (nucleotide, amino acid, codons)

>KU900236_9694-10176 KU900236
ATGTTACAACTGTTATTGGTAGTTTATATATTGTTTTTGGTGATATTGATCGATAATATT
AAATTAAAATATTTACGTCGTAGTTTTAATTTTAAAAAAATTTCTATGTTTGAGCAATAT
CATTATATTGTAATATGTTCAAACTTACAAATAGTTTCAAATTTAAAACAATTATTACTT
CAATATCCTACAATAAAAATTCAATTTTTTAAAAAATCAAACAGAAATATCTATTTAATT
TTTTTATTACCATATTTAACTAATTCATTAATTTTATTGGGATGTAACGAACTTAATGTT
TTTTTTAAATTGTGTGAATGTGTTTCTAAAAATATTTTGTTTATAAAAGTACAAAATACG
ATTTATTCTATAAATCAATTTATGGATTGTTCTTCAAATCAAATTATGTTTGGACAAACT
TTAAATTCACTTTATTATAATTTGATTAAAGTTTTTTATTCTTTTTCTTTATTACATAAA
TAA

>KU900236_9694-10176 KU900236
MLQLLLVVYILFLVILIDNIKLKYLRRSFNFKKISMFEQYHYIVICSNLQIVSNLKQLLL
QYPTIKIQFFKKSNRNIYLIFLLPYLTNSLILLGCNELNVFFKLCECVSKNILFIKVQNT
IYSINQFMDCSSNQIMFGQTLNSLYYNLIKVFYSFSLLHK*

$$$, &&&, ^^^ and %%% and ### are codons 1 base different from ATG

>KU900236_9694-10176
ATG TTA CAA CTG TTA TTG GTA GTT TAT ATA TTG TTT TTG GTG ATA TTG ATC GAT AAT ATT
 M   L   Q   L   L   L   V   V   Y   I   L   F   L   V   I   L   I   D   N   I
 1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20
                                 *  $$$ &&&     &&& ^^^ $$$ &&& %%%         ###

AAA TTA AAA TAT TTA CGT CGT AGT TTT AAT TTT AAA AAA ATT TCT ATG TTT GAG CAA TAT
 K   L   K   Y   L   R   R   S   F   N   F   K   K   I   S   M   F   E   Q   Y
                                                            ^^^

CAT TAT ATT GTA ATA TGT TCA AAC TTA CAA ATA GTT TCA AAT TTA AAA CAA TTA TTA CTT
 H   Y   I   V   I   C   S   N   L   Q   I   V   S   N   L   K   Q   L   L   L

CAA TAT CCT ACA ATA AAA ATT CAA TTT TTT AAA AAA TCA AAC AGA AAT ATC TAT TTA ATT
 Q   Y   P   T   I   K   I   Q   F   F   K   K   S   N   R   N   I   Y   L   I

TTT TTA TTA CCA TAT TTA ACT AAT TCA TTA ATT TTA TTG GGA TGT AAC GAA CTT AAT GTT
 F   L   L   P   Y   L   T   N   S   L   I   L   L   G   C   N   E   L   N   V

TTT TTT AAA TTG TGT GAA TGT GTT TCT AAA AAT ATT TTG TTT ATA AAA GTA CAA AAT ACG
 F   F   K   L   C   E   C   V   S   K   N   I   L   F   I   K   V   Q   N   T

ATT TAT TCT ATA AAT CAA TTT ATG GAT TGT TCT TCA AAT CAA ATT ATG TTT GGA CAA ACT
 I   Y   S   I   N   Q   F   M   D   C   S   S   N   Q   I   M   F   G   Q   T

TTA AAT TCA CTT TAT TAT AAT TTG ATT AAA GTT TTT TAT TCT TTT TCT TTA TTA CAT AAA
 L   N   S   L   Y   Y   N   L   I   K   V   F   Y   S   F   S   L   L   H   K

TAA
 *

Table of Contents