cgeb2001's DokuWiki!

This is an old revision of the document!

Evaluating and comparing transcriptome assemblies with rnaQUAST

Once you have assembled your RNA-seq reads into transcriptomes, you'll want to have some sense of assembly quality, or just a summary of your transcriptome in general (number of transcripts, number of predicted genes, estimated completeness, longest transcript etc.)

A common tool to evaluate genome assemblies is QUAST. Thankfully the fine folks who created QUAST also created a special tool to evaluate transcriptome assemblies: rnaQUAST. This is the tool that we will be using.

Here is an example perun submission script:

#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -m bea
#$ -pe threaded 8
#$ -q 144G-batch,256G-batch,768G-batch

source activate rnaquast-2.2.0

rnaQUAST.py \
    -o 6_rnaquast-2.2.0 \
    -c 4_trinity-2.11-with-work-around/Trinity.fasta 5_rnaspades-3.14.1/transcripts.fasta \
    -t 8 \
    -l trinity rnaspades \
    -ss \
    --gene_mark \
    --busco /home/jmartijn/db/busco/eukaryota_odb10

The script activates the rnaquast-2.2.0 conda environment. 2.2.0 is the latest version as of January 2021. This environment contains all the necessary software to run the job smoothly.

Quick explanation on the options:

-o            output_directory
-c            transcriptome1.fasta transcriptome2.fasta ... transcriptomeN.fasta
-t            number_of_threads
-l            transcriptome_label1 transcriptome_label2 ... transcriptome_labelN
-ss           invoke if you used a strand specific mRNA library
--gene_mark   to activate gene prediction with GeneMark
--busco       to activate completeness evaluation with BUSCO

The labels are used to rename your transcriptomes in the context of the rnaQUAST analysis. transcriptome1.fasta will be renamed to transcriptome_label1, etc etc.

Note that the submission script is specifically asking to run on 144G-batch, 256G-batch or 768G-batch nodes. For some unclear reason only these nodes will run the software smoothly without fail. It may work on some specific 16G-batch nodes as well but its a gamble.

In this case I ran rnaQUAST without reference genomes, but it is possible to do so. That should give you a lot more interesting metrics to look at.

Output

Here is the output that I got as an example. As you can see I got about twice as many transcripts with rnaspades compared to trinity, but trinity yielded in general longer transcripts and more predicted genes. Completeness estimates are about the same.

short_report.txt

SHORT SUMMARY REPORT

METRICS/TRANSCRIPTS                                    trinity                  rnaspades

 == BASIC TRANSCRIPTS METRICS ==
Transcripts                                            16142                    33100
Transcripts > 500 bp                                   7208                     4977
Transcripts > 1000 bp                                  4584                     3447

 == BUSCO METRICS ==
Complete                                               39.216                   37.647
Partial                                                8.235                    8.627

 == GeneMarkS-T METRICS ==
Predicted genes                                        6936                     5098

rnaQUAST will also output a PDF with a nice Cumulative transcript / isoform length plot.