===== orf160 in Blastocystis mitochondrial genomes =====

Knowledge from the Literature: Jacob et al, 2016, GBE

=== Basic properties ===

//orf160// is an enigmatic gene found in //Blastocystis// mitochondrial genomes. Named as such because, even though its not technically an ORF (see below), it looks like one, and its 158 and 162 amino acids long (about ~480 nt)

=== In-frame STOP codons ===

Only in ST4 DMP/02-328 and ST4 DMP/10-212 is this a //bona fide// ORF. In all other described subtypes, it has one or two //in-frame// STOP codons. In ST4, position 9 is ''%%TGG(Trp)%%'' or ''%%TAT(Tyr)%%''. Position 11 is ''%%TTG(Leu)%%''. In all other strains, positions 9 and 11 are ''%%TAG%%'' STOP codons. In ST8 DMP/08-128, position 2 is a ''%%TGA%%'' STOP codon. Most other STOP codons in the //Blastocystis// mitochondrial genomes are ''%%TAA%%''. No alternative START codons (''%%CTG%%'' or ''%%TTG%%'') are found after position 9

=== No clear homologs ===

It has extremely low sequence similarity to anything in the public sequence databases, and no BLAST hits outside of //Blastocystis// were identified. Even within //Blastocystis//, BLAST hits only have ~27.5% sequence similarity at the amino acid level. The %GC and amino acid composition are somewhat reminiscent of some ribosomal proteins

=== It is not a pseudogene ===

It is unlikely to be a **pseudogene**: 
  * If the in-frame STOP codon was due to pseudogenization, it was a pseudogene already in the //Blastocystis// ancestor. Given the age of the pseudogene, we would expect to see more in-frame STOP codons
  * There are no other in-frame STOP codons
  * The ORF is still pretty long, 480 nucleotides
  * dN/dS ratio < 1, indicating negative selection

=== orf160 (negative strand) overlaps with its upstream neighbor, nad7 (negative strand) ===

Jacob //et al// identified a 55 or 56 bp overlap on the 5 prime end of //orf160//. This means that the first 19 codons / aa’s of //orf160// overlap with the 3 prime end of //nad7//. We checked with the //Blastocystis// NandII strain the overlap with //nad7//

=== No evidence as of yet for transcription of this gene ===
Jacob //et al// tried RT-PCR. Roger lab with Eleni Gentekaki also found no EST or RNAseq evidence


=== Hypothesis 1: TAG has been re-assigned to a sense codon ====

If this is true, all mitochondrial genes should end with ''%%TAA%%'' or ''%%TGA%%'', which according to Jacob //et al// is not always true. If this is true, we should be able to find a tRNA with an anticodon able to recognize ''%%TAG%%''. Jacob //et al// was unable to find any in the ST1 nuclear genome

=== Hypothesis 2: TAG in position 9 is RNA edited to a sense codon ===

Unable to check this hypothesis because no RNA data available

=== Hypothesis 3: Translational read-through circumvents translation of TAG ===

=== Hypothesis 4: Alternative start codon? Other than CTG or TTG ===


==== Joran's work in June 2025 ====

=== orf160 encodes for mitochondrial ribosomal subunit RPL10 ===

I took the protein sequence of one of the two only proper //orf160// copies from ST4 DMP/10-212 (''%%APC25055.1%%'') and threw it into online AlphaFold. It gave me a decent structure, and threw that into Foldseek. Among the ''%%AFDB-SwissProt%%'' I got top hits to ''%%39S ribosomal protein L10, mitochondrial%%'' among Eukaryotes, and ''%%50S ribosomal protein L10%%'' among Bacteria. Apparently, RPL10 is either absent or unrecognized in many eukaryotic lineages (Ryo Harada //et al// 2025, Journal of Eukaryotic Microbiology). Among all those, the first 21-25 or so amino acids on the N-terminal region did not align in the Foldseek alignment. The unaligned N-terminal part of the query also seems to sticks out of the structure a bit.

=== ST4 DMP/10-212 orf160 does not seem to have any targeting signals ===

I tried Deeploc and SignalP

=== orf160 in Blastocystis genomes ST7c,e,g,h,b also have the in-frame STOP codons ===

As a query, I used the publically available //orf160// copy of ST7B, ''%%CU914152.1%%'', which is the complete nucleotide sequence of ST7B mitochondrial genome. The relevant sequence of //orf160// in that, (coordinates 13813-14292), with the in-frame stop codon annotated:

<code>
>CU914152_13813-14292 CU914152
  .  .  .  .  .  .  .  .TAG.
ATGTTACCACTGTTATTGGTAGTTTAGATATTGTTTTCGGTGATATTGATAGATAATTTA
AAAATTGTACGTAAATATAATTATATTTTAAAATTTAATCAATATTTTAAAAATTATAAA
TATATGTTATTTTGTGATAATACTAGTTTAAATTTGAATTTATATAAACATGAAATTTTG
TTAAATCCTAATGTTAAATGTATTTTCTTAAAAAAATTTAAATGTATTGATAATTTAACA
TATTTTAATTCTAATTTAAAAAATTCAACAGTAATTTTTTGTACTAATGATTTACAAACT
TTATATTTAATAATTTCTAAATTACAAACTAATATATTATTTTGTAAAATACAAAATAAT
TATTATTCTTTAAAAAATTTAAATACTTATATTAATTCTATATATGGATTAGTTAATTAT
TTAGATAATTATATGAGTAATTTTATATTTTTATTTCAACAAATTTCTAAAAAACAATAA
</code>

I found that strains B, E (Seq_114_MRO), and G had perfect matches. C, E (Seq_115_MRO) and H had some frameshifts, which introduced many other STOP codons. The frameshifts always occurred in this area ''%%TGTATTTTCTTAAAAAAATTT%%'', which is a bit downstream of the in-frame STOP codon

Errors were unsurprisingly in the long homopolymer ''%%A%%'' region. If you had 7, no frameshifts. C and H frameshifts could be explained by __persistent, unpolished sequencing errors__:

Canu struggles sometimes with circular genomes, and the MRO contigs were 1,5x - 2x too big. The ‘superfluous’ regions had for some reason poor illumina sequencing coverage, and therefore had poor polishing. Unknowing about this, (totally understandable), Greg had cut the ‘good’ parts and kept the ‘bad’ parts. //orf160// was found in ‘good’ and ‘bad’ parts, and Greg had thus kept the ‘bad’ //orf160//

After correcting for this, by recutting the MRO contigs, the //orf160// copies now had perfect matches to the ''%%CU914152%%'' copy. The E (Seq_114_MRO) frameshift is also probably a persistent, unpolished sequencing error. However, the superfluous area of Seq_114_MRO did not have an extra //orf160// copy, and no ‘good’ copy existed. Yet, if we look at the Illumina coverage of //orf160// we could see that the homopolymer area in question was not well covered. Hence, its likely a remaining sequencing error

==== The TAG codon is probably not reassigned to a sense codon ====

=== No tRNA genes with anticodon able to recognize TAG ===

The perfect anticodon to ''%%TAG%%'' would be ''%%CTA%%'', or more specifically ''%%CUA%%'' in RNA terms:

<code>
3'-AUC-5' anticodon in tRNA
   |||
5'-UAG-3' codon in mRNA
</code>

However, this may also be possible? ''%%TTA%%'', or ''%%UUA%%'' in RNA terms

<code>
3'-AUU-5' anticodon in tRNA
   |||
5'-UAG-3' codon in mRNA
</code>

But ''%%UUA%%'' anticodon would fit much better with ''%%TAA%%'', a STOP codon. So unlikely that there is actually a ''%%UUA%%''/''%%TTA%%'' tRNA

<code>
3'-AUU-5' anticodon in tRNA
   |||
5'-UAA-3' codon in mRNA
</code>

I ran ''%%tRNAscan-SE%%'' on the nuclear and mitochondrial genome of ST7C (assembled by Greg and I), but could not find any ''%%CUA%%''%%//%%''%%CTA%%'' or ''%%UUA%%''%%//%%''%%TTA%%'' tRNAs. ''%%prokka%%'', which runs ''%%aragorn%%'' under the hood, also did not find any tRNAs with these anticodons in the ST7C mitochondrial genome. Aragorn is a lot more sensitive (and picks up many false positives as well), so if even aragorn can't find such tRNAs it probably doesn't exist.


=== TAG is used as a STOP in at least two other genes ===

I ran ''%%prokka%%'' with ''%%--addgenes --addmrna --kingdom Mitochondria --cds rnaolap --gcode 1%%'' on Greg’s ST7C mitochondrial genome.''%%--kingdom Mitochondria%%'' which ensures it searches mitochondrial databases for functional annotation. ''%%--gcode 1%%'' to ensure it uses the standard code. ''%%prokka%%'' runs ''%%prodigal%%'' under the hood, which uses Genetic Code 11 (Bacteria, Archaea, Plastids) by default

It found one gene that end with ''%%TGA%%'': ''%%ST7C_00011%%'' and three genes that end with ''%%TAG%%'': ''%%ST7C_00027%%'' (hypothetical), ''%%ST7C_00039%%'' (nadj), ''%%ST7C_00057%%'' (nad4).
  
To assess whether these genes were predicted with the correct end codons, I compared with existing annotations of homologs in ''ST4 DMP/10-212''.

**ST7C_00011** may be a false positive gene. It is a short 43 amino acids, with only bacterial, uncharacterized, BLAST hits in public databases, (there was 1 //Blasto// hit but also may be a mispredicted gene) and it is entirely enveloped (on the opposite strand) by ''%%ST7C_00010%%'', most likely the gene encoding 16S rRNA gene.

**ST7C_00027** is homolog of ''%%APC25073.1%%'', ribosomal protein S12. ''%%ST7C_00027%%'' is length 146, ''%%APC25073.1%%'' is length 125. They have the same start, but ''%%ST7C_00027%%'' is about 21 amino acids longer. ''%%APC25073%%'' stop codon is ''%%TAA%%'' (see below). Their nucleotide sequences are very similar around the ''%%TAA%%'' codon in ST4. ST7C seemingly has an insertion relative to ST4, which means the ''%%TAA%%'' gets skipped until the downstream ''%%TAG%%'' is found. __This could explain why the ST7C copy ends with ''TAG'' instead of ''TAA''__. The insertion happens in a ''%%AAAAAAA%%'' stretch, suggesting it could be a homopolymer type sequencing error. However, upon checking the DNA Illumina mapping, there does not seem to be any sequencing errors in this area in ST7C.. It’s also curious that this is also the area where a tRNA is predicted to start on the same strand. You would perhaps not expect this overlap.

<code>
 tRNA ST7C                    TCT CTA ATA GTT CAA GGG TTA .... TAGAGAG     # tRNA Asn (gtt)

       G   T   K   R   K   K   S   L   I   V   Q    G    L   E   H   M   T   V   N   H   V   V   I   G   S   S   P   I   *
 ST7C GGT ACA AAA AGA AAA AAA TCT CTA ATA GTT CAA -GG G TTA GAA CAC ATG ACT GTT AAT CAT GTT GTT ATA GGT TCG AGT CCT ATT TAG
      ||  |   ||| | | ||| |   |||  || ||| ||| |||  || i
 ST4  GGA ATG AAA AAA AAA AG- TCT TTA ATA GTT CAA AGG T
       G   M   K   K   K    S    L   *   *   F   K   G
                                   TA A

 tRNA ST4                     TCT TTA ATA GTT CAA AGG TTA .... TAGAGAG     # tRNA Asn (gtt)
</code>


**ST7C_00039** is homolog of `APC25066.1`, NADH dehydrogenase subunit 9 (nad9). `ST7C_00039` is length 196, `APC25066.1` is also length 196. They have the same start and end. `APC25066.1` stop codon is `TAA` (see below). No insertions or deletions, also no sequencing errors in ST7C at the STOP codon. It looks like this is simply a mutation causing the difference between the two strains. __Suggests the ST7C `TAG` is a true stop codon__


<code>
      S   P   K   F   K   S   Y   Y   N   Y   D   N   F   Y   S   F   *
ST7C TCT CCT AAA TTT AAA TCT TAT TAT AAT TAT GAT AAT TTT TAT TCA TTT TAG
     ||. ||. ||| ||| ||| ||| ||| ||| |   | |  || ||| ||| |||  |  ||  ||   
ST4  TCA CCA AAA TTT AAA TCT TAT TAT ATG TTT AAT AAT TTT TAT ACT TTA TAA
      S   P   K   F   K   S   Y   Y   M   F   N   N   F   Y   T   L   *
</code>


**ST7C_00057** is homolog of ''APC25056.1'' (nad4). ''ST7C_00057'' is length 73, ''APC25056.1'' is length 487! `ST7C_00057` matches only the C-terminal part of `APC25056.1`. Possibly because `ST7C_00057` is the last gene on the linear representation of the circular MRO genome. Prokka calls prodigal with `meta` and `-c`, which maybe (unsure what is meant with running off edges) means that genes do not wrap around from end to start of FASTA. `APC25056.1` stop codon is `TAA` (see below). `ST7C_00057` and `APC25056` end at the same position, suggesting the `TAG` of ST7C is at least at the right spot.


<code>
      F   I   I   G   I   Y   P   T   F   I   L   D   Y   L   N   M   S   V   S   F   L   L   N   I   V   S   C   *
ST7C TTT ATT ATA GGT ATT TAT CCT ACT TTC ATT TTA GAT TAT TTG AAT ATG TCA GTT AGT TTT TTA TTA AAT ATA GTA TCT TGT TAG
     |||  |  ||  ||   |  ||| ||| ||| ||   |  ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| |||  ||  |  | | ||| ||
ST4  TTT TTA ATT GGA TTA TAT CCT ACT TTT TTA TTA GAT TAT TTG AAT ATG TCA GTT AGT TTT TTA TTA AAT TTA ATT TGT TGT TAA
      F   L   I   G   L   Y   P   T   F   L   L   D   Y   L   N   M   S   V   S   F   L   L   N   L   I   C   C   *
</code>


=== ST7C nuclear genome encodes a mitochondrial release factors likely to bind to UAG ===

There are broadly speaking two mitochondrial release factors:
    * ''%%mtRF1%%'', which binds ''%%UAA%%'' and ''%%UAG%%''
    * ''%%mtRF2%%'', which binds ''%%UAA%%'' and ''%%UGA%%''

Using human, yeast and annotated //Blasto// copies, and BLASTP vs all st7 strains, I’ve identified the ST7 //Blasto// homologs:
    * ''%%mtRF1%%'': ''%%ST7C_HKIIKG_7787_gene%%'' (''%%Seq23%%'') (reciprocal blast against uniprot returns peptide release factor 1, ''%%prfA%%'')
    * ''%%mtRF2%%'': ''%%ST7C_TYYQWW_4460_gene%%'' (''%%Seq9%%'') (reciprocal blast against uniprot returns peptide release factor 2, ''%%prfB%%'')
    * Both are encoded in the nuclear genome. Targeting peptides?

''%%ST7C_HKIIKG_7787_gene%%'' has the glutamine Gln181 residue: ''%%GVHRV--Q--RVPETESQGRIHTSTMTVAVL%%'' (Q = glutamine). Gln181 is critical for recognizing the ''%%G%%'' in ''%%TAG%%'', and it having it suggests that ''%%TAG%%'' still functions as a STOP codon (Zihala & Elias, 2019). It also does not have a serine at position 206, which is associated with being able to recognize both ''%%A%%'' and ''%%G%%'' in the second codon position (allowing recognition of ''%%TAA%%'' and ''%%TGA%%'') in ''%%RF2%%''. It also has a ''%%T%%'', or Thr186, responsible for discrimination of adenine //against// guanine and the second codon position. This means it should **not** recognize ''%%TGA%%''.


==== There are hints that orf160 actually starts downstream of the in-frame STOP codons ====

  * All //Blasto// mitochondrial genomes have a 56 or 55 bp overlap between the end of //nad7// and the ‘M’ start of //orf160//, which is really __an unusually large overlap__. This suggests that perhaps the true start of //orf160// is a bit after the ‘M’ start? The first 55/56 bp of the supposed //orf160// gene are much more conserved than the remainder of the gene, (see below) which is possibly explained by the fact that this part is for //nad7// and not for //rpl10// or //orf160//

  * The AlphaFold structure’s N-terminus has __a helical structure of about ~25aa but ‘sticks’ out from the rest of the structure__. Perhaps unnatural sticking out?

  * The AlphaFold + FoldSeek hits of the ST4 query do invariably __not align the first 21-28 or so amino acids__ of their target hits. Suggesting perhaps this N-terminal part of the query is not homologous to RPL10?

  * AlphaFold3 + Foldseek with __alternative V-14 start also recovers mitochondrial RPL10 hits__, suggesting this could still function possibly with the V as the start


Some STs also have genes encoding RPS4 that appears to lack an ''%%ATG%%'' start codon.


==== FoldMason alignment of orf160 suggests ATG is not used as a START codon ====

//orf160// sequences within //Blastocystis// are too divergent to get sensible alignments with ''mafft''. Since Foldseek gave us confident hits that it encodes RPL10, we can infer that the structure is fairly well conserved, even if the primary sequence is not. Hence, ideally, we would like to use structural information to align all //orf160// homologs. We can do this with **FoldMason**, another tool from the same guys that developed Foldseek.

I ran AlphaFold3 on all Jacob //et al// //orf160// predicted amino acid sequences (replacing ‘*’ with ‘X’) of many different subtypes, and used FoldMason MSA to align these predicted structures with predicted structures of the best Foldseek hits (using ST4 homolog as query).

The first 20-30 or so amino acids of the //Blasto// homologs do indeed not seem to align to the core RPL10 domain. Further suggesting that perhaps this part is not actually translated by //Blasto//

After those 20-30 amino acids, there was not a single Methionine strictly conserved across all //Blasto// subtypes. So, if it indeeds starts //after// the ''TAG'' STOP codon, and it is indeed not a pseudogene, then it must use an alternative START codon.

If so, then what is that START codon?


==== The search for alternative initiatior tRNAs ==== 

According to Jacob //et al//, 2016 and Brocal & Clark 2008, the MRO genome encodes three tRNA genes. Two elongator tRNAs (Me1 and Me2), that are close to each other on the mitochondrial genome, and one initiator tRNA (Mf), located in between the small (12S) and large (16S) subunit rRNA genes.

All of them have the CAU anticodon, which would match with the ATG start codon. So, perhaps there is another initiator tRNA gene, either on the mitochondrial or on the nuclear genome, that could fit with an alternative start in the //orf160// gene.

=== Two Ile (aat anticodon) tRNA genes were possibly missed by previous annotations ===

Visual inspection of //Blasto// MRO genome gene structures (as annotated in the literature), revealed that some areas were unannotated, or unaccounted for. In particular, within a block of tRNA genes there seems to be a gap large enough to fit exactly one additional tRNA gene. Also between //nad6// and //12S rRNA// and between //nad2// and //nad11// there is such a gap. Perhaps these areas could encode for a tRNA that is an alternative initiator.

The literature annotations for all subtypes are derived from Perez-Brocal and Clark, 2008, who sequenced MRO genomes of //Blastocystis// DMP_02_328 (ST4) and NandII (ST1). tRNA genes for those two strains were predicted with **tRNAscan-SE**.

tRNAscan-SE is very specific, but can lack sensitivity in mitochondrial genomes, as its hidden markov models are either from Bacteria, Archaea, Eukaryotic Nuclear, or metazoan/mammalian mitochondrial genomes. **Aragorn** on the other hand is highly sensitive, but not very specific. Seems to be also better at detecting mitochondrial type tRNAs

I applied aragorn in mitochondrial mode to all Jacob //et al// blasto MRO genomes: 

<code>
aragorn -mt -mtd -c -d -e -rp -br -wa -o $base.aragorn.out $fasta
aragorn -mt -mtd -c -d -e -rp -o $base.aragorn.full.out $fasta
</code>

I compared with pre-existing annotations. It found loads of false positives, but there were a few interesting hits:

1. A tRNA gene in all 3 ST3 strains between //nad6// and //12S rRNA gene//, with anticodon `aat` (codon `ATT`):


<code>
 
               t
             t-a
             t-a
             t.t
             a-t
             g+t
             t.t
             t-a
             t-a     a
            t   tatat a
   ataa    a    !!!!!  a
  t    aaat     atata  t
  a    !!!!    t     tt
  t    ttta     t
   agat    a     a
            a-tga
            g-c
            a-t
            g+t
            a-t
           t   t
           t   g
            aat

    mtRNA-Ile(aat)
    77 bases, %GC = 9.1
    Sequence [1139,1215]
    Score = 102.678
</code>

2. A tRNA in ST6, with ''%%aat%%'' as anticodon, in between Asn and Leu tRNAs in a tRNA block:

<code>
 
            t
          t-a
          a-t
          a-t
          a-t
          a-t
          t.t
         at.t      gaa
         t   tgattt   a
    a   a    :!!!!!   t
   t aac     tctaaa   t
   a !!!    a      ttat
     ttg     a
    a   t     g
         a-tga
         a-t
         a-t
         t.t
         a a
        t   a
        c   g
         aat


 Possible Pseudogene
 mtRNA-Ile(aat)
 73 bases, %GC = 12.3
 Sequence [28432,28504]
 Score = 94.5612
</code>

So both possibly new tRNAs are with the anticodon ''%%aat%%'', which recognizes codon ''%%TTA%%''. The original annotation already had ''%%gat%%'' anticodon Ile, and ''%%ATC%%'' as codon. So perhaps some redundancy here, and so perhaps these new hits are pseudogenes? ''%%aat%%'' could according to theory, only possibly base pair with codon ''%%TTA%%''

=== Blastocystis imports many tRNAs from the cytosol ===

The Jacob //et al// blasto MRO genomes seem to encode tRNAs for only 14 amino acids, and are lacking tRNAs for Ser, Val, Thr, Gln, Arg and Gly. Brocal & Clark, 2008 had already observed this paucity of tRNA genes and called this the "the most dramatic case of tRNA gene loss observed within the stramenopiles". The mtDNA encoded proteins do include these amino acids, so their tRNAs must be encoded on the nuclear genome, and imported into the mitochondrion. Maybe the alternative initiator tRNA is also imported?

=== The mitochondrial gene for RPS4 also lacks a START codon in several STs ===

RPS4 also lacks a typical start codon in ST1, ST2, ST4 and ST8 (Jacob et al 2016). If they use an alternative START, it may be the same START that //orf160// is using. RPS4 in Blastocystis seems curiously also twice the size of that in //Proteromonas lacertae//.

I collected RPS4 homologs from Jacob //et al// MRO genomes and //Proteromonas lacertae//, and did a AF3 + FoldMason alignment, and then back transformed it into a codon alignment. 

The AF3 predicted structures, even between closely related Blasto subtypes, do not seem to align very well in 3D.

RPS4 of //Blasto// is indeed about 2x the size of that of //Proteromonas//.  The //Proteromonas// copy starts with ''%%ATG%%''. FoldMason aligned the //Proteromonas// copy with the N-terminus of the //Blastocystis// copies. However, it seemingly only aligns well (in the sequence alignment) with the Blasto copies in a particular section: the section that yields good Foldseek hits with public RPS4 homologs in the databases.

I compared the RPS4 codon alignment with that of the RPL10 / orf160 alignment. Codons that seem to be conserved (but not perfectly), in the beginning area of both RPL10 and RPS4 are ''%%AAA%%'', ''%%AAT%%''.

However, since the sequences don't even align that well, I'm not even sure where the true RPS4 gene starts and end in Blastocystis and Proteromonas mtDNA.

==== Little to no expression of orf160 in regular and riboZeroPlus RNAseq data of ST7C ====

To see whether the mysterious //orf160// is actually expressed, I inspected the regular RNAseq, that is, mRNA sequencing via oligo-dT primers/probes, by mapping it on the ST7C genome, that included the MRO genome.

Here is the visual overview:

{{:overall_mro_expression_on_regularrnaseq.png|}}

You can see some RNAseq reads overlapping with ORF160 on its 5' end, and in the proper direction too. However, these may be reads coming from the immediately upstream gene //nad7// (in the IGV figure wrongly annotated as //ndhH//)

Since a regular, polyA capturing RNAseq does by its nature not capture mitochondrial transcripts, which are not believed to be poly-adenylated, this lack in expression may come from a pure lack of captured mitochondrial transcripts in the experimental design.

I therefore tried to resequence the same RNA sample using a different kit, the riboZeroPlus kit. This kit uses a set of custom designed probes to remove rRNA transcripts from the total RNA extract prior to library prep and sequencing.

I used the following probes:

<code>
- 18S rRNA probes (one per line):

TTTCATAAACAAACCAAAAAATCGACTATGAAAGCCAATCTTATTATTCC
CAAACACTTTCAATAAATTATCTAAACTTCAACTACGAGCTTTTTAACTG
TTATCCATATAGAAACTATTCCAAATAAACTATAACTGATATAATGAGCC
CTAACAAGCATGCGATAAAGTCAACAATTATTATTACTCACAATTCAATT
TAGCTTTCGTTCTTGATTAATGAAAACATCCTTGGTAAATGCTTTCGCAC
CAGATACTCGTTGAATAGTTCAGTGTCGCGCGCGTGCAGCCCAGAACATC
CTAAAACTATTTAGACTTACACATGCATGGCTTAATCTTTGAGACGAGCG
CCATGGTAGTCCAATACACTACCATCGAAAGCTGATAGGGCAGAAACTTG
GAAAAATTACAAGCATCAATCCCCATCACGAACTATTTTCAAAAGATTTC
AAATCATAGAATTTCACCTCTAGCTATTGAATATGAATACCCCCAACTGT
TCACCTTCCTCTAGATGATAAGATTTACACGACTTCTCTTCAACTATCTA
ATAAGTACTTCTTTAATGGTTGCCCATCAAAGAAAACACATGTATTAGCC
ACTAACTCCTAGTCGGTATCGTTTATAGCTAAGACTACGAGGGTATCTAA
CTATCAATCTGTCAATCCTTCCTATGTCTGGACCTGGTAAGTTTCCCCGT
TCCTTGCGGAACCATGGCACCCACCTGGATGTCGATAACTTACATAAAAG
GATTTATTGTCACTACCTCCCTGTGTCAGGATTGGGTAATTTACGCGCCT
ATAATTAAAAATCCAAAGTGTTCACCGGATCATCCAATCGGTAGGTGCGA
AAGGGCAGGGACGTAATCAACGCAAGTTGATGACTTGCATTTACTAGGAA
CCTGTTATTGCTTCCAGCTTCCCCGTACTCAAACGCACAGTGTCCCTCTA
ACAATGGGGCATTACTAAAATCCCATTTCATCCAACTAATAGGCGGAAGT
AACTGAACAGTCCGCTTTAAACACTCTAATTTTCTCACAGTAAATGACCA
TGTGGTAGCCATCTCTCAGGCTCCCTCTCCGAAATCGAACCCAAATTCTT
ACTCCCCCCGGAACCCAAAGACTTTGATTTCTCATAAGGTACTAATAGAC
TTGTTTATCGATAACGATTGTACATTGTTCTCAATTCAATTACAAAACCA

- 28S rRNA probes (one per line)

CTAACAATGTCTCCCACGTGGGTTGCAACTCGAGAGAGAAGCTTACACAT
AGCCTTTGATGGAGTTTACCACCAACTTCGAGCTGCAATCCCAAACAACT
AAGCCATCACCCCATATTATGGAATAAGTAAAACAACATTAGAGGTAGTG
TCCATGCATCATTCAACCACTCCTACGCTTAACCCCTCCACGATTTCAAG
ATTCAAAATATTGAATTCCTTTACCAATAACAAAACCTTTTCGCGGATTC
GTCGTCTACAAAGGATCTTTGTTCATTGACCATTAAAAATGCTATCAGGG
AGTCCAGCTTACCCGGAATGGCCCACTAGCAACTACTATTCAAAATTACA
AGGCTGTTCGCTTAAGCGCCATCCATTTTCAGGGCTACTTCATTCGGCAG
TTTTCAAAGTGCTTTTCATCTTTCCCTCACGGTACTTGTTCGCTATCGGT
AGCACTGGGCAGAATTCACATTGTGTCAATATATCTTTCACACTATCACA
TTTATCAGAGATGCAAGACCGGTAGTTGTTGCTAGCTCTCTTTAGACAAA
TTTTCTATCCAACTGAGCGAACAATTAGGCGCCGTACCATATCGTTCGGT
AGGTTGACAAATTGCAGAAATAGTTAATAGGGCCGTCCACCTCCCCAGGG
GTTTCAAGACGGGACGGAGAAGCAGTTATTAGGAAAGAGGAAATTCAGTA
AAGCAACTATAATATCTTACCCATTCAAAGTTTGAGAATAGGTCCAGGAT
AAATGTGTTCCCAAAGGGAGGGAAATAATATTACTTTTCAAGGACCCATT
AAGCCGTATCTACTCAAATAGGCTTCTTTATATAGGTCACATCCTTTGGT
CTGCTTCACAAGTACAATACACTATGCAAATACAGGGTTTTCACCTTCTA
GCTACTTCCACCAAGATCTGCACTAATGGACATTCCATATAAGTTTACAC
CATTATTCTATTAACTAGAGGCTATTCACCTTGGAGACCTGATGCGGTTA
GAGAAGAGGTAATAAGGGAAAGGGAATTAATTGATATTTACCAATTTAAC
TACATATTTTAGGAGGGCTTCATGATTAGAGGCTTTCATCACTACGACCC
CGTTCAAAGATTCAATGACTCACAGACTTCTGCAGTTCGCATTACGTATC
TCTCACATTTTACCCAGTCTGCAAGGTATTGGTAGGAAGAGCCGACATCG
AGTTCAACACGATTCCTATGGAACCTTTCTCCACTTCAGTCTTCAAAGAT
CGAGAACCACTGTATTCATATCACTAACCTAGTCAATTGAACTGTTGTCG
TAGTAGACAGACATCCAAGTCAAATCACACTCCAACAAGCATACTCCCAA
AGAGAGTCATAGTTACTCCCGCCGTTTACCCGCGCTTGGTTGAATTCCTT
CATCAATCATCTCATTCATTTGATAACCAAGAACTGACGATCCTATCATT
TCTGTTACCATTCAATTCCATTTCATTGGTTCAGGAATATTAACCTGATT
ACCTTCATTACGCATTTTAGTTTAACACTAAACTACTCGCAAATATGATA
GTTCTAAAAATTCAAAAGAACTTTTTCAACGGATTTCACCTATCTCTTAG
TTTTCCTCTGCTTAGTTAGATGCTTCAATTCAGCAGGTCTTCTTGCTTGA
ATCCAATTCTCATAGTATACTGTTACTAAACAATACTTCTACACTCCACA
CCTAGCCCTCAGAGCCAATCCTTATCCCGAAGTTACGGATCTAATTTGCC
ATTCTATTTCAATGGAGGAAACTCTTAGTCAATCCACCATCAATCATCGT
TTCGTCCTATTCAGGCATAGTTCACCATCTTTCGGGTCCCACCATCTTTG
CCCTTAAAAAGAGTCTCCCACCTATTCTACACCCTCTAAGTCATTTCACA
CATACTGAAAATCAAAATCAAATGAGCTTTTACCCTTTTATTCTACGTAA
TGAGCTCATCTTAGGACACCTGTGTTATTCTTTAACAGATGTGCCGCCCC
GATAAGTCTCAATTTCTCGTTGAACTAAGTCAACTCGAAAACTTACAACC
CCTCTAATCATTCGCTTTACCTCATAAAACTAGACACAGTTGCAGCTATC
GTGTTAATTCGGATTGGGCTTTTCCCACTTCACTCGCCGTTACTAAGGGA
TCCATCACGCCTTCCTACTTGTCACCCCATAATATAACCATCTACTTGAG
CTAGCTTTAAACTCGAAATTCAAATATCTAAAGGATCGATAGGCCATATT
TAAACAGTCGGATTCCCCTTGTCCGTACCAGTTCTGAGTCAGCTATTCAT
CCCAAATTTAAAGATCAATTTGCACGTTAGAATCCACTCGAACCTCCACC
TTTATTATTGTTAACAAGAAAAGAAAACTCTTCCCAGGAGAGTAACCGAT
TACCACCACTAAACAACCACTCCTTTGCATACATTCTTATCATCACAAAC
CAAGCTCAACAGGGTCTTCTTTCCCCGCTGATATTTCCAAGCCCATTCCC

</code>

Sequencing was done at the Genomics CORE Lab in the LSRI, with Mat as contact person.

The sequencing run was excellent. Got a lot of data, and it was also really good quality.

After quality trimming the data, I mapped it to the latest version of the ST7C genome with HISAT2.

A lot of reads still mapped to the rRNA genes, but all the other areas still had more than sufficient enough coverage as well.

Importantly, **the mitochondrial genome had much more reads mapping to it**:

{{:overall_mro_expression_ribozero.png|}}

**This is not just the result of a higher throughput**. The throughput of the riboZeroPlus run (179 506 076 mapped reads) was about twice that of the original RNAseq run (85 875 153 reads), but far more than twice time the amount of reads now mapped to the MRO genome. (NOTE that for both IGV figures I used the same coverage scale of 2000 in the Coverage track).

What is striking is that the mitochondrial rRNA genes still had an enormous amount of coverage. Perhaps next time you want to sequence the mitochondrial transcriptome of some organism, also include probes targeting the mitochondrial rRNA genes!

Another striking thing is that it seems that **RNAseq coverage of mitochondrial ribosomal genes is much lower** than that of the //nad// genes! Exceptions seem to be //rps12// and //rpl16//.

Zooming in to //orf160// / //rpl10//:

{{:orf160_expression_screenshot.png|}}

Unfortunately again it seems we are not seeing any significant evidence of expression of this gene. It may be that the throughput for this gene in particular was not high enough to detect any real expression, so we can't rule it out. Any reads that are overlapping with orf160 may be 3'UTR reads from the //nad7// gene upstream (here annotated as NdhH)
====== Ideas to explore ======

  * Check RNA expression levels
  * Try to sequence mitochondrial RNA or specifically orf160 RNA (RT-PCR plus sequencing - Jacob et al designed primers for orf160 and rps4, but were unsuccesful). As of July 2025, we still have (I think) total RNA extracts from ST7C, E and H in the -80 in the main lab (in one of Gregs boxes).
  * 
  * check for shine dalgarno sequences, possibly after the in-frame start codon?
  * Andrew: I guess to know if this is ‘significant’ you’d have to look at the density of these kinds of codons throughout the whole sequence. Since mtDNAs like this are A+T-rich, codons that are ‘close’ to ‘ATG’ might not be so rare.
  * Andrew: One way to test this is to look at the distances from the in-frame stop codon to all ‘near start codons’ in the sequence and and add them all up. Then randomly choose the same number of codon positions in that same interval (not allowing choosing the same position twice) and calculate the same distance. Do the latter step 100 times and that gives you a distribution on what would a uniform distribution look like. If the ‘true’ summed distance is smaller than the random distribution, then it would suggest that ‘near start codons’ are clustered towards the beginning.

====== Useful data ======

ST4 DMP/10-212 //orf160// sequence (nucleotide, amino acid, codons)

<code>

>KU900236_9694-10176 KU900236
ATGTTACAACTGTTATTGGTAGTTTATATATTGTTTTTGGTGATATTGATCGATAATATT
AAATTAAAATATTTACGTCGTAGTTTTAATTTTAAAAAAATTTCTATGTTTGAGCAATAT
CATTATATTGTAATATGTTCAAACTTACAAATAGTTTCAAATTTAAAACAATTATTACTT
CAATATCCTACAATAAAAATTCAATTTTTTAAAAAATCAAACAGAAATATCTATTTAATT
TTTTTATTACCATATTTAACTAATTCATTAATTTTATTGGGATGTAACGAACTTAATGTT
TTTTTTAAATTGTGTGAATGTGTTTCTAAAAATATTTTGTTTATAAAAGTACAAAATACG
ATTTATTCTATAAATCAATTTATGGATTGTTCTTCAAATCAAATTATGTTTGGACAAACT
TTAAATTCACTTTATTATAATTTGATTAAAGTTTTTTATTCTTTTTCTTTATTACATAAA
TAA

>KU900236_9694-10176 KU900236
MLQLLLVVYILFLVILIDNIKLKYLRRSFNFKKISMFEQYHYIVICSNLQIVSNLKQLLL
QYPTIKIQFFKKSNRNIYLIFLLPYLTNSLILLGCNELNVFFKLCECVSKNILFIKVQNT
IYSINQFMDCSSNQIMFGQTLNSLYYNLIKVFYSFSLLHK*

$$$, &&&, ^^^ and %%% and ### are codons 1 base different from ATG

>KU900236_9694-10176
ATG TTA CAA CTG TTA TTG GTA GTT TAT ATA TTG TTT TTG GTG ATA TTG ATC GAT AAT ATT
 M   L   Q   L   L   L   V   V   Y   I   L   F   L   V   I   L   I   D   N   I
 1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20
                                 *  $$$ &&&     &&& ^^^ $$$ &&& %%%         ###

AAA TTA AAA TAT TTA CGT CGT AGT TTT AAT TTT AAA AAA ATT TCT ATG TTT GAG CAA TAT
 K   L   K   Y   L   R   R   S   F   N   F   K   K   I   S   M   F   E   Q   Y
                                                            ^^^

CAT TAT ATT GTA ATA TGT TCA AAC TTA CAA ATA GTT TCA AAT TTA AAA CAA TTA TTA CTT
 H   Y   I   V   I   C   S   N   L   Q   I   V   S   N   L   K   Q   L   L   L

CAA TAT CCT ACA ATA AAA ATT CAA TTT TTT AAA AAA TCA AAC AGA AAT ATC TAT TTA ATT
 Q   Y   P   T   I   K   I   Q   F   F   K   K   S   N   R   N   I   Y   L   I

TTT TTA TTA CCA TAT TTA ACT AAT TCA TTA ATT TTA TTG GGA TGT AAC GAA CTT AAT GTT
 F   L   L   P   Y   L   T   N   S   L   I   L   L   G   C   N   E   L   N   V

TTT TTT AAA TTG TGT GAA TGT GTT TCT AAA AAT ATT TTG TTT ATA AAA GTA CAA AAT ACG
 F   F   K   L   C   E   C   V   S   K   N   I   L   F   I   K   V   Q   N   T

ATT TAT TCT ATA AAT CAA TTT ATG GAT TGT TCT TCA AAT CAA ATT ATG TTT GGA CAA ACT
 I   Y   S   I   N   Q   F   M   D   C   S   S   N   Q   I   M   F   G   Q   T

TTA AAT TCA CTT TAT TAT AAT TTG ATT AAA GTT TTT TAT TCT TTT TCT TTA TTA CAT AAA
 L   N   S   L   Y   Y   N   L   I   K   V   F   Y   S   F   S   L   L   H   K

TAA
 *
</code>