Knowledge from the Literature: Jacob et al, 2016, GBE
orf160 is an enigmatic gene found in Blastocystis mitochondrial genomes. Named as such because, even though its not technically an ORF (see below), it looks like one, and its 158 and 162 amino acids long (about ~480 nt)
Only in ST4 DMP/02-328 and ST4 DMP/10-212 is this a bona fide ORF. In all other described subtypes, it has one or two in-frame STOP codons. In ST4, position 9 is TGG(Trp) or TAT(Tyr). Position 11 is TTG(Leu). In all other strains, positions 9 and 11 are TAG STOP codons. In ST8 DMP/08-128, position 2 is a TGA STOP codon. Most other STOP codons in the Blastocystis mitochondrial genomes are TAA. No alternative START codons (CTG or TTG) are found after position 9
It has extremely low sequence similarity to anything in the public sequence databases, and no BLAST hits outside of Blastocystis were identified. Even within Blastocystis, BLAST hits only have ~27.5% sequence similarity at the amino acid level. The %GC and amino acid composition are somewhat reminiscent of some ribosomal proteins
It is unlikely to be a pseudogene:
Jacob et al identified a 55 or 56 bp overlap on the 5 prime end of orf160. This means that the first 19 codons / aa’s of orf160 overlap with the 3 prime end of nad7. We checked with the Blastocystis NandII strain the overlap with nad7
Jacob et al tried RT-PCR. Roger lab with Eleni Gentekaki also found no EST or RNAseq evidence
If this is true, all mitochondrial genes should end with TAA or TGA, which according to Jacob et al is not always true. If this is true, we should be able to find a tRNA with an anticodon able to recognize TAG. Jacob et al was unable to find any in the ST1 nuclear genome
Unable to check this hypothesis because no RNA data available
I took the protein sequence of one of the two only proper orf160 copies from ST4 DMP/10-212 (APC25055.1) and threw it into online AlphaFold. It gave me a decent structure, and threw that into Foldseek. Among the AFDB-SwissProt I got top hits to 39S ribosomal protein L10, mitochondrial among Eukaryotes, and 50S ribosomal protein L10 among Bacteria. Apparently, RPL10 is either absent or unrecognized in many eukaryotic lineages (Ryo Harada et al 2025, Journal of Eukaryotic Microbiology). Among all those, the first 21-25 or so amino acids on the N-terminal region did not align in the Foldseek alignment. The unaligned N-terminal part of the query also seems to sticks out of the structure a bit.
I tried Deeploc and SignalP
As a query, I used the publically available orf160 copy of ST7B, CU914152.1, which is the complete nucleotide sequence of ST7B mitochondrial genome. The relevant sequence of orf160 in that, (coordinates 13813-14292), with the in-frame stop codon annotated:
>CU914152_13813-14292 CU914152 . . . . . . . .TAG. ATGTTACCACTGTTATTGGTAGTTTAGATATTGTTTTCGGTGATATTGATAGATAATTTA AAAATTGTACGTAAATATAATTATATTTTAAAATTTAATCAATATTTTAAAAATTATAAA TATATGTTATTTTGTGATAATACTAGTTTAAATTTGAATTTATATAAACATGAAATTTTG TTAAATCCTAATGTTAAATGTATTTTCTTAAAAAAATTTAAATGTATTGATAATTTAACA TATTTTAATTCTAATTTAAAAAATTCAACAGTAATTTTTTGTACTAATGATTTACAAACT TTATATTTAATAATTTCTAAATTACAAACTAATATATTATTTTGTAAAATACAAAATAAT TATTATTCTTTAAAAAATTTAAATACTTATATTAATTCTATATATGGATTAGTTAATTAT TTAGATAATTATATGAGTAATTTTATATTTTTATTTCAACAAATTTCTAAAAAACAATAA
I found that strains B, E (Seq_114_MRO), and G had perfect matches. C, E (Seq_115_MRO) and H had some frameshifts, which introduced many other STOP codons. The frameshifts always occurred in this area TGTATTTTCTTAAAAAAATTT, which is a bit downstream of the in-frame STOP codon
Errors were unsurprisingly in the long homopolymer A region. If you had 7, no frameshifts. C and H frameshifts could be explained by persistent, unpolished sequencing errors:
Canu struggles sometimes with circular genomes, and the MRO contigs were 1,5x - 2x too big. The ‘superfluous’ regions had for some reason poor illumina sequencing coverage, and therefore had poor polishing. Unknowing about this, (totally understandable), Greg had cut the ‘good’ parts and kept the ‘bad’ parts. orf160 was found in ‘good’ and ‘bad’ parts, and Greg had thus kept the ‘bad’ orf160
After correcting for this, by recutting the MRO contigs, the orf160 copies now had perfect matches to the CU914152 copy. The E (Seq_114_MRO) frameshift is also probably a persistent, unpolished sequencing error. However, the superfluous area of Seq_114_MRO did not have an extra orf160 copy, and no ‘good’ copy existed. Yet, if we look at the Illumina coverage of orf160 we could see that the homopolymer area in question was not well covered. Hence, its likely a remaining sequencing error
The perfect anticodon to TAG would be CTA, or more specifically CUA in RNA terms:
3'-AUC-5' anticodon in tRNA ||| 5'-UAG-3' codon in mRNA
However, this may also be possible? TTA, or UUA in RNA terms
3'-AUU-5' anticodon in tRNA ||| 5'-UAG-3' codon in mRNA
But UUA anticodon would fit much better with TAA, a STOP codon. So unlikely that there is actually a UUA/TTA tRNA
3'-AUU-5' anticodon in tRNA ||| 5'-UAA-3' codon in mRNA
I ran tRNAscan-SE on the nuclear and mitochondrial genome of ST7C (assembled by Greg and I), but could not find any CUA//CTA or UUA//TTA tRNAs. prokka, which runs aragorn under the hood, also did not find any tRNAs with these anticodons in the ST7C mitochondrial genome. Aragorn is a lot more sensitive (and picks up many false positives as well), so if even aragorn can't find such tRNAs it probably doesn't exist.
I ran prokka with --addgenes --addmrna --kingdom Mitochondria --cds rnaolap --gcode 1 on Greg’s ST7C mitochondrial genome.--kingdom Mitochondria which ensures it searches mitochondrial databases for functional annotation. --gcode 1 to ensure it uses the standard code. prokka runs prodigal under the hood, which uses Genetic Code 11 (Bacteria, Archaea, Plastids) by default
It found one gene that end with TGA: ST7C_00011 and three genes that end with TAG: ST7C_00027 (hypothetical), ST7C_00039 (nadj), ST7C_00057 (nad4).
To assess whether these genes were predicted with the correct end codons, I compared with existing annotations of homologs in ST4 DMP/10-212.
ST7C_00011 may be a false positive gene. It is a short 43 amino acids, with only bacterial, uncharacterized, BLAST hits in public databases, (there was 1 Blasto hit but also may be a mispredicted gene) and it is entirely enveloped (on the opposite strand) by ST7C_00010, most likely the gene encoding 16S rRNA gene.
ST7C_00027 is homolog of APC25073.1, ribosomal protein S12. ST7C_00027 is length 146, APC25073.1 is length 125. They have the same start, but ST7C_00027 is about 21 amino acids longer. APC25073 stop codon is TAA (see below). Their nucleotide sequences are very similar around the TAA codon in ST4. ST7C seemingly has an insertion relative to ST4, which means the TAA gets skipped until the downstream TAG is found. This could explain why the ST7C copy ends with TAG instead of TAA. The insertion happens in a AAAAAAA stretch, suggesting it could be a homopolymer type sequencing error. However, upon checking the DNA Illumina mapping, there does not seem to be any sequencing errors in this area in ST7C.. It’s also curious that this is also the area where a tRNA is predicted to start on the same strand. You would perhaps not expect this overlap.
tRNA ST7C TCT CTA ATA GTT CAA GGG TTA .... TAGAGAG # tRNA Asn (gtt)
G T K R K K S L I V Q G L E H M T V N H V V I G S S P I *
ST7C GGT ACA AAA AGA AAA AAA TCT CTA ATA GTT CAA -GG G TTA GAA CAC ATG ACT GTT AAT CAT GTT GTT ATA GGT TCG AGT CCT ATT TAG
|| | ||| | | ||| | ||| || ||| ||| ||| || i
ST4 GGA ATG AAA AAA AAA AG- TCT TTA ATA GTT CAA AGG T
G M K K K S L * * F K G
TA A
tRNA ST4 TCT TTA ATA GTT CAA AGG TTA .... TAGAGAG # tRNA Asn (gtt)
ST7C_00039 is homolog of `APC25066.1`, NADH dehydrogenase subunit 9 (nad9). `ST7C_00039` is length 196, `APC25066.1` is also length 196. They have the same start and end. `APC25066.1` stop codon is `TAA` (see below). No insertions or deletions, also no sequencing errors in ST7C at the STOP codon. It looks like this is simply a mutation causing the difference between the two strains. Suggests the ST7C `TAG` is a true stop codon
S P K F K S Y Y N Y D N F Y S F *
ST7C TCT CCT AAA TTT AAA TCT TAT TAT AAT TAT GAT AAT TTT TAT TCA TTT TAG
||. ||. ||| ||| ||| ||| ||| ||| | | | || ||| ||| ||| | || ||
ST4 TCA CCA AAA TTT AAA TCT TAT TAT ATG TTT AAT AAT TTT TAT ACT TTA TAA
S P K F K S Y Y M F N N F Y T L *
ST7C_00057 is homolog of APC25056.1 (nad4). ST7C_00057 is length 73, APC25056.1 is length 487! `ST7C_00057` matches only the C-terminal part of `APC25056.1`. Possibly because `ST7C_00057` is the last gene on the linear representation of the circular MRO genome. Prokka calls prodigal with `meta` and `-c`, which maybe (unsure what is meant with running off edges) means that genes do not wrap around from end to start of FASTA. `APC25056.1` stop codon is `TAA` (see below). `ST7C_00057` and `APC25056` end at the same position, suggesting the `TAG` of ST7C is at least at the right spot.
F I I G I Y P T F I L D Y L N M S V S F L L N I V S C *
ST7C TTT ATT ATA GGT ATT TAT CCT ACT TTC ATT TTA GAT TAT TTG AAT ATG TCA GTT AGT TTT TTA TTA AAT ATA GTA TCT TGT TAG
||| | || || | ||| ||| ||| || | ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| ||| || | | | ||| ||
ST4 TTT TTA ATT GGA TTA TAT CCT ACT TTT TTA TTA GAT TAT TTG AAT ATG TCA GTT AGT TTT TTA TTA AAT TTA ATT TGT TGT TAA
F L I G L Y P T F L L D Y L N M S V S F L L N L I C C *
There are broadly speaking two mitochondrial release factors:
mtRF1, which binds UAA and UAGmtRF2, which binds UAA and UGAUsing human, yeast and annotated Blasto copies, and BLASTP vs all st7 strains, I’ve identified the ST7 Blasto homologs:
mtRF1: ST7C_HKIIKG_7787_gene (Seq23) (reciprocal blast against uniprot returns peptide release factor 1, prfA)mtRF2: ST7C_TYYQWW_4460_gene (Seq9) (reciprocal blast against uniprot returns peptide release factor 2, prfB)
ST7C_HKIIKG_7787_gene has the glutamine Gln181 residue: GVHRV--Q--RVPETESQGRIHTSTMTVAVL (Q = glutamine). Gln181 is critical for recognizing the G in TAG, and it having it suggests that TAG still functions as a STOP codon (Zihala & Elias, 2019). It also does not have a serine at position 206, which is associated with being able to recognize both A and G in the second codon position (allowing recognition of TAA and TGA) in RF2. It also has a T, or Thr186, responsible for discrimination of adenine against guanine and the second codon position. This means it should not recognize TGA.
Some STs also have genes encoding RPS4 that appears to lack an ATG start codon.
orf160 sequences within Blastocystis are too divergent to get sensible alignments with mafft. Since Foldseek gave us confident hits that it encodes RPL10, we can infer that the structure is fairly well conserved, even if the primary sequence is not. Hence, ideally, we would like to use structural information to align all orf160 homologs. We can do this with FoldMason, another tool from the same guys that developed Foldseek.
I ran AlphaFold3 on all Jacob et al orf160 predicted amino acid sequences (replacing ‘*’ with ‘X’) of many different subtypes, and used FoldMason MSA to align these predicted structures with predicted structures of the best Foldseek hits (using ST4 homolog as query).
The first 20-30 or so amino acids of the Blasto homologs do indeed not seem to align to the core RPL10 domain. Further suggesting that perhaps this part is not actually translated by Blasto
After those 20-30 amino acids, there was not a single Methionine strictly conserved across all Blasto subtypes. So, if it indeeds starts after the TAG STOP codon, and it is indeed not a pseudogene, then it must use an alternative START codon.
If so, then what is that START codon?
According to Jacob et al, 2016 and Brocal & Clark 2008, the MRO genome encodes three tRNA genes. Two elongator tRNAs (Me1 and Me2), that are close to each other on the mitochondrial genome, and one initiator tRNA (Mf), located in between the small (12S) and large (16S) subunit rRNA genes.
All of them have the CAU anticodon, which would match with the ATG start codon. So, perhaps there is another initiator tRNA gene, either on the mitochondrial or on the nuclear genome, that could fit with an alternative start in the orf160 gene.
Visual inspection of Blasto MRO genome gene structures (as annotated in the literature), revealed that some areas were unannotated, or unaccounted for. In particular, within a block of tRNA genes there seems to be a gap large enough to fit exactly one additional tRNA gene. Also between nad6 and 12S rRNA and between nad2 and nad11 there is such a gap. Perhaps these areas could encode for a tRNA that is an alternative initiator.
The literature annotations for all subtypes are derived from Perez-Brocal and Clark, 2008, who sequenced MRO genomes of Blastocystis DMP_02_328 (ST4) and NandII (ST1). tRNA genes for those two strains were predicted with tRNAscan-SE.
tRNAscan-SE is very specific, but can lack sensitivity in mitochondrial genomes, as its hidden markov models are either from Bacteria, Archaea, Eukaryotic Nuclear, or metazoan/mammalian mitochondrial genomes. Aragorn on the other hand is highly sensitive, but not very specific. Seems to be also better at detecting mitochondrial type tRNAs
I applied aragorn in mitochondrial mode to all Jacob et al blasto MRO genomes:
aragorn -mt -mtd -c -d -e -rp -br -wa -o $base.aragorn.out $fasta aragorn -mt -mtd -c -d -e -rp -o $base.aragorn.full.out $fasta
I compared with pre-existing annotations. It found loads of false positives, but there were a few interesting hits:
1. A tRNA gene in all 3 ST3 strains between nad6 and 12S rRNA gene, with anticodon `aat` (codon `ATT`):
t
t-a
t-a
t.t
a-t
g+t
t.t
t-a
t-a a
t tatat a
ataa a !!!!! a
t aaat atata t
a !!!! t tt
t ttta t
agat a a
a-tga
g-c
a-t
g+t
a-t
t t
t g
aat
mtRNA-Ile(aat)
77 bases, %GC = 9.1
Sequence [1139,1215]
Score = 102.678
2. A tRNA in ST6, with aat as anticodon, in between Asn and Leu tRNAs in a tRNA block:
t
t-a
a-t
a-t
a-t
a-t
t.t
at.t gaa
t tgattt a
a a :!!!!! t
t aac tctaaa t
a !!! a ttat
ttg a
a t g
a-tga
a-t
a-t
t.t
a a
t a
c g
aat
Possible Pseudogene
mtRNA-Ile(aat)
73 bases, %GC = 12.3
Sequence [28432,28504]
Score = 94.5612
So both possibly new tRNAs are with the anticodon aat, which recognizes codon TTA. The original annotation already had gat anticodon Ile, and ATC as codon. So perhaps some redundancy here, and so perhaps these new hits are pseudogenes? aat could according to theory, only possibly base pair with codon TTA
The Jacob et al blasto MRO genomes seem to encode tRNAs for only 14 amino acids, and are lacking tRNAs for Ser, Val, Thr, Gln, Arg and Gly. Brocal & Clark, 2008 had already observed this paucity of tRNA genes and called this the “the most dramatic case of tRNA gene loss observed within the stramenopiles”. The mtDNA encoded proteins do include these amino acids, so their tRNAs must be encoded on the nuclear genome, and imported into the mitochondrion. Maybe the alternative initiator tRNA is also imported?
RPS4 also lacks a typical start codon in ST1, ST2, ST4 and ST8 (Jacob et al 2016). If they use an alternative START, it may be the same START that orf160 is using. RPS4 in Blastocystis seems curiously also twice the size of that in Proteromonas lacertae.
I collected RPS4 homologs from Jacob et al MRO genomes and Proteromonas lacertae, and did a AF3 + FoldMason alignment, and then back transformed it into a codon alignment.
The AF3 predicted structures, even between closely related Blasto subtypes, do not seem to align very well in 3D.
RPS4 of Blasto is indeed about 2x the size of that of Proteromonas. The Proteromonas copy starts with ATG. FoldMason aligned the Proteromonas copy with the N-terminus of the Blastocystis copies. However, it seemingly only aligns well (in the sequence alignment) with the Blasto copies in a particular section: the section that yields good Foldseek hits with public RPS4 homologs in the databases.
I compared the RPS4 codon alignment with that of the RPL10 / orf160 alignment. Codons that seem to be conserved (but not perfectly), in the beginning area of both RPL10 and RPS4 are AAA, AAT.
However, since the sequences don't even align that well, I'm not even sure where the true RPS4 gene starts and end in Blastocystis and Proteromonas mtDNA.
To see whether the mysterious orf160 is actually expressed, I inspected the regular RNAseq, that is, mRNA sequencing via oligo-dT primers/probes, by mapping it on the ST7C genome, that included the MRO genome.
Here is the visual overview:
You can see some RNAseq reads overlapping with ORF160 on its 5' end, and in the proper direction too. However, these may be reads coming from the immediately upstream gene nad7 (in the IGV figure wrongly annotated as ndhH)
Since a regular, polyA capturing RNAseq does by its nature not capture mitochondrial transcripts, which are not believed to be poly-adenylated, this lack in expression may come from a pure lack of captured mitochondrial transcripts in the experimental design.
I therefore tried to resequence the same RNA sample using a different kit, the riboZeroPlus kit. This kit uses a set of custom designed probes to remove rRNA transcripts from the total RNA extract prior to library prep and sequencing.
I used the following probes:
- 18S rRNA probes (one per line): TTTCATAAACAAACCAAAAAATCGACTATGAAAGCCAATCTTATTATTCC CAAACACTTTCAATAAATTATCTAAACTTCAACTACGAGCTTTTTAACTG TTATCCATATAGAAACTATTCCAAATAAACTATAACTGATATAATGAGCC CTAACAAGCATGCGATAAAGTCAACAATTATTATTACTCACAATTCAATT TAGCTTTCGTTCTTGATTAATGAAAACATCCTTGGTAAATGCTTTCGCAC CAGATACTCGTTGAATAGTTCAGTGTCGCGCGCGTGCAGCCCAGAACATC CTAAAACTATTTAGACTTACACATGCATGGCTTAATCTTTGAGACGAGCG CCATGGTAGTCCAATACACTACCATCGAAAGCTGATAGGGCAGAAACTTG GAAAAATTACAAGCATCAATCCCCATCACGAACTATTTTCAAAAGATTTC AAATCATAGAATTTCACCTCTAGCTATTGAATATGAATACCCCCAACTGT TCACCTTCCTCTAGATGATAAGATTTACACGACTTCTCTTCAACTATCTA ATAAGTACTTCTTTAATGGTTGCCCATCAAAGAAAACACATGTATTAGCC ACTAACTCCTAGTCGGTATCGTTTATAGCTAAGACTACGAGGGTATCTAA CTATCAATCTGTCAATCCTTCCTATGTCTGGACCTGGTAAGTTTCCCCGT TCCTTGCGGAACCATGGCACCCACCTGGATGTCGATAACTTACATAAAAG GATTTATTGTCACTACCTCCCTGTGTCAGGATTGGGTAATTTACGCGCCT ATAATTAAAAATCCAAAGTGTTCACCGGATCATCCAATCGGTAGGTGCGA AAGGGCAGGGACGTAATCAACGCAAGTTGATGACTTGCATTTACTAGGAA CCTGTTATTGCTTCCAGCTTCCCCGTACTCAAACGCACAGTGTCCCTCTA ACAATGGGGCATTACTAAAATCCCATTTCATCCAACTAATAGGCGGAAGT AACTGAACAGTCCGCTTTAAACACTCTAATTTTCTCACAGTAAATGACCA TGTGGTAGCCATCTCTCAGGCTCCCTCTCCGAAATCGAACCCAAATTCTT ACTCCCCCCGGAACCCAAAGACTTTGATTTCTCATAAGGTACTAATAGAC TTGTTTATCGATAACGATTGTACATTGTTCTCAATTCAATTACAAAACCA - 28S rRNA probes (one per line) CTAACAATGTCTCCCACGTGGGTTGCAACTCGAGAGAGAAGCTTACACAT AGCCTTTGATGGAGTTTACCACCAACTTCGAGCTGCAATCCCAAACAACT AAGCCATCACCCCATATTATGGAATAAGTAAAACAACATTAGAGGTAGTG TCCATGCATCATTCAACCACTCCTACGCTTAACCCCTCCACGATTTCAAG ATTCAAAATATTGAATTCCTTTACCAATAACAAAACCTTTTCGCGGATTC GTCGTCTACAAAGGATCTTTGTTCATTGACCATTAAAAATGCTATCAGGG AGTCCAGCTTACCCGGAATGGCCCACTAGCAACTACTATTCAAAATTACA AGGCTGTTCGCTTAAGCGCCATCCATTTTCAGGGCTACTTCATTCGGCAG TTTTCAAAGTGCTTTTCATCTTTCCCTCACGGTACTTGTTCGCTATCGGT AGCACTGGGCAGAATTCACATTGTGTCAATATATCTTTCACACTATCACA TTTATCAGAGATGCAAGACCGGTAGTTGTTGCTAGCTCTCTTTAGACAAA TTTTCTATCCAACTGAGCGAACAATTAGGCGCCGTACCATATCGTTCGGT AGGTTGACAAATTGCAGAAATAGTTAATAGGGCCGTCCACCTCCCCAGGG GTTTCAAGACGGGACGGAGAAGCAGTTATTAGGAAAGAGGAAATTCAGTA AAGCAACTATAATATCTTACCCATTCAAAGTTTGAGAATAGGTCCAGGAT AAATGTGTTCCCAAAGGGAGGGAAATAATATTACTTTTCAAGGACCCATT AAGCCGTATCTACTCAAATAGGCTTCTTTATATAGGTCACATCCTTTGGT CTGCTTCACAAGTACAATACACTATGCAAATACAGGGTTTTCACCTTCTA GCTACTTCCACCAAGATCTGCACTAATGGACATTCCATATAAGTTTACAC CATTATTCTATTAACTAGAGGCTATTCACCTTGGAGACCTGATGCGGTTA GAGAAGAGGTAATAAGGGAAAGGGAATTAATTGATATTTACCAATTTAAC TACATATTTTAGGAGGGCTTCATGATTAGAGGCTTTCATCACTACGACCC CGTTCAAAGATTCAATGACTCACAGACTTCTGCAGTTCGCATTACGTATC TCTCACATTTTACCCAGTCTGCAAGGTATTGGTAGGAAGAGCCGACATCG AGTTCAACACGATTCCTATGGAACCTTTCTCCACTTCAGTCTTCAAAGAT CGAGAACCACTGTATTCATATCACTAACCTAGTCAATTGAACTGTTGTCG TAGTAGACAGACATCCAAGTCAAATCACACTCCAACAAGCATACTCCCAA AGAGAGTCATAGTTACTCCCGCCGTTTACCCGCGCTTGGTTGAATTCCTT CATCAATCATCTCATTCATTTGATAACCAAGAACTGACGATCCTATCATT TCTGTTACCATTCAATTCCATTTCATTGGTTCAGGAATATTAACCTGATT ACCTTCATTACGCATTTTAGTTTAACACTAAACTACTCGCAAATATGATA GTTCTAAAAATTCAAAAGAACTTTTTCAACGGATTTCACCTATCTCTTAG TTTTCCTCTGCTTAGTTAGATGCTTCAATTCAGCAGGTCTTCTTGCTTGA ATCCAATTCTCATAGTATACTGTTACTAAACAATACTTCTACACTCCACA CCTAGCCCTCAGAGCCAATCCTTATCCCGAAGTTACGGATCTAATTTGCC ATTCTATTTCAATGGAGGAAACTCTTAGTCAATCCACCATCAATCATCGT TTCGTCCTATTCAGGCATAGTTCACCATCTTTCGGGTCCCACCATCTTTG CCCTTAAAAAGAGTCTCCCACCTATTCTACACCCTCTAAGTCATTTCACA CATACTGAAAATCAAAATCAAATGAGCTTTTACCCTTTTATTCTACGTAA TGAGCTCATCTTAGGACACCTGTGTTATTCTTTAACAGATGTGCCGCCCC GATAAGTCTCAATTTCTCGTTGAACTAAGTCAACTCGAAAACTTACAACC CCTCTAATCATTCGCTTTACCTCATAAAACTAGACACAGTTGCAGCTATC GTGTTAATTCGGATTGGGCTTTTCCCACTTCACTCGCCGTTACTAAGGGA TCCATCACGCCTTCCTACTTGTCACCCCATAATATAACCATCTACTTGAG CTAGCTTTAAACTCGAAATTCAAATATCTAAAGGATCGATAGGCCATATT TAAACAGTCGGATTCCCCTTGTCCGTACCAGTTCTGAGTCAGCTATTCAT CCCAAATTTAAAGATCAATTTGCACGTTAGAATCCACTCGAACCTCCACC TTTATTATTGTTAACAAGAAAAGAAAACTCTTCCCAGGAGAGTAACCGAT TACCACCACTAAACAACCACTCCTTTGCATACATTCTTATCATCACAAAC CAAGCTCAACAGGGTCTTCTTTCCCCGCTGATATTTCCAAGCCCATTCCC
Sequencing was done at the Genomics CORE Lab in the LSRI, with Mat as contact person.
The sequencing run was excellent. Got a lot of data, and it was also really good quality.
After quality trimming the data, I mapped it to the latest version of the ST7C genome with HISAT2.
A lot of reads still mapped to the rRNA genes, but all the other areas still had more than sufficient enough coverage as well.
Importantly, the mitochondrial genome had much more reads mapping to it:
This is not just the result of a higher throughput. The throughput of the riboZeroPlus run (179 506 076 mapped reads) was about twice that of the original RNAseq run (85 875 153 reads), but far more than twice time the amount of reads now mapped to the MRO genome. (NOTE that for both IGV figures I used the same coverage scale of 2000 in the Coverage track).
What is striking is that the mitochondrial rRNA genes still had an enormous amount of coverage. Perhaps next time you want to sequence the mitochondrial transcriptome of some organism, also include probes targeting the mitochondrial rRNA genes!
Another striking thing is that it seems that RNAseq coverage of mitochondrial ribosomal genes is much lower than that of the nad genes! Exceptions seem to be rps12 and rpl16.
Zooming in to orf160 / rpl10:
Unfortunately again it seems we are not seeing any significant evidence of expression of this gene. It may be that the throughput for this gene in particular was not high enough to detect any real expression, so we can't rule it out. Any reads that are overlapping with orf160 may be 3'UTR reads from the nad7 gene upstream (here annotated as NdhH)
ST4 DMP/10-212 orf160 sequence (nucleotide, amino acid, codons)
>KU900236_9694-10176 KU900236
ATGTTACAACTGTTATTGGTAGTTTATATATTGTTTTTGGTGATATTGATCGATAATATT
AAATTAAAATATTTACGTCGTAGTTTTAATTTTAAAAAAATTTCTATGTTTGAGCAATAT
CATTATATTGTAATATGTTCAAACTTACAAATAGTTTCAAATTTAAAACAATTATTACTT
CAATATCCTACAATAAAAATTCAATTTTTTAAAAAATCAAACAGAAATATCTATTTAATT
TTTTTATTACCATATTTAACTAATTCATTAATTTTATTGGGATGTAACGAACTTAATGTT
TTTTTTAAATTGTGTGAATGTGTTTCTAAAAATATTTTGTTTATAAAAGTACAAAATACG
ATTTATTCTATAAATCAATTTATGGATTGTTCTTCAAATCAAATTATGTTTGGACAAACT
TTAAATTCACTTTATTATAATTTGATTAAAGTTTTTTATTCTTTTTCTTTATTACATAAA
TAA
>KU900236_9694-10176 KU900236
MLQLLLVVYILFLVILIDNIKLKYLRRSFNFKKISMFEQYHYIVICSNLQIVSNLKQLLL
QYPTIKIQFFKKSNRNIYLIFLLPYLTNSLILLGCNELNVFFKLCECVSKNILFIKVQNT
IYSINQFMDCSSNQIMFGQTLNSLYYNLIKVFYSFSLLHK*
$$$, &&&, ^^^ and %%% and ### are codons 1 base different from ATG
>KU900236_9694-10176
ATG TTA CAA CTG TTA TTG GTA GTT TAT ATA TTG TTT TTG GTG ATA TTG ATC GAT AAT ATT
M L Q L L L V V Y I L F L V I L I D N I
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
* $$$ &&& &&& ^^^ $$$ &&& %%% ###
AAA TTA AAA TAT TTA CGT CGT AGT TTT AAT TTT AAA AAA ATT TCT ATG TTT GAG CAA TAT
K L K Y L R R S F N F K K I S M F E Q Y
^^^
CAT TAT ATT GTA ATA TGT TCA AAC TTA CAA ATA GTT TCA AAT TTA AAA CAA TTA TTA CTT
H Y I V I C S N L Q I V S N L K Q L L L
CAA TAT CCT ACA ATA AAA ATT CAA TTT TTT AAA AAA TCA AAC AGA AAT ATC TAT TTA ATT
Q Y P T I K I Q F F K K S N R N I Y L I
TTT TTA TTA CCA TAT TTA ACT AAT TCA TTA ATT TTA TTG GGA TGT AAC GAA CTT AAT GTT
F L L P Y L T N S L I L L G C N E L N V
TTT TTT AAA TTG TGT GAA TGT GTT TCT AAA AAT ATT TTG TTT ATA AAA GTA CAA AAT ACG
F F K L C E C V S K N I L F I K V Q N T
ATT TAT TCT ATA AAT CAA TTT ATG GAT TGT TCT TCA AAT CAA ATT ATG TTT GGA CAA ACT
I Y S I N Q F M D C S S N Q I M F G Q T
TTA AAT TCA CTT TAT TAT AAT TTG ATT AAA GTT TTT TAT TCT TTT TCT TTA TTA CAT AAA
L N S L Y Y N L I K V F Y S F S L L H K
TAA
*