===== Evaluating and comparing transcriptome assemblies with rnaQUAST =====
Documentation by Joran Martijn (11 January 2021)
Once you have assembled your RNA-seq reads into transcriptomes, you'll want to have some sense of assembly quality, or just a summary of your transcriptome in general (number of transcripts, number of predicted genes, estimated completeness, longest transcript etc.)
A common tool to evaluate genome assemblies is [[https://github.com/ablab/quast|QUAST]]. Thankfully the fine folks who created QUAST also created a special tool to evaluate transcriptome assemblies: [[https://github.com/ablab/rnaquast|rnaQUAST]]. This is the tool that we will be using.
Here is an example perun submission script:
#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -m bea
#$ -pe threaded 8
#$ -q 144G-batch,256G-batch,768G-batch
source activate rnaquast-2.2.0
rnaQUAST.py \
-o 6_rnaquast-2.2.0 \
-c 4_trinity-2.11-with-work-around/Trinity.fasta 5_rnaspades-3.14.1/transcripts.fasta \
-t 8 \
-l trinity rnaspades \
-ss \
--gene_mark \
--busco /home/jmartijn/db/busco/eukaryota_odb10
The script activates the rnaquast-2.2.0 conda environment. 2.2.0 is the latest version as of January 2021. This environment contains all the necessary software to run the job smoothly.
Quick explanation on the options:
-o output_directory
-c transcriptome1.fasta transcriptome2.fasta ... transcriptomeN.fasta
-t number_of_threads
-l transcriptome_label1 transcriptome_label2 ... transcriptome_labelN
-ss invoke if you used a strand specific mRNA library
--gene_mark to activate gene prediction with GeneMark
--busco to activate completeness evaluation with BUSCO
The labels are used to rename your transcriptomes in the context of the rnaQUAST analysis. ''transcriptome1.fasta'' will be renamed to ''transcriptome_label1'', etc etc.
Note that the submission script is specifically asking to run on ''144G-batch'', ''256G-batch'' or ''768G-batch'' nodes. For some unclear reason only these nodes will run the software smoothly without fail. It may work on some specific ''16G-batch'' nodes as well but its a gamble.
In this case I ran rnaQUAST without reference genomes, but it is possible to do so. That should give you a lot more interesting metrics to look at.
=== Output ===
Here is the output that I got as an example. As you can see I got about twice as many transcripts with rnaspades compared to trinity, but trinity yielded in general longer transcripts and more predicted genes. Completeness estimates are about the same.
''short_report.txt''
SHORT SUMMARY REPORT
METRICS/TRANSCRIPTS trinity rnaspades
== BASIC TRANSCRIPTS METRICS ==
Transcripts 16142 33100
Transcripts > 500 bp 7208 4977
Transcripts > 1000 bp 4584 3447
== BUSCO METRICS ==
Complete 39.216 37.647
Partial 8.235 8.627
== GeneMarkS-T METRICS ==
Predicted genes 6936 5098
rnaQUAST will also output a PDF with a nice Cumulative transcript / isoform length plot.