This is an old revision of the document!
Table of Contents
Gene prediction with the Funannotate pipeline
Joran Martijn (December 2022)
Funannotate is a genome prediction, annotation, and comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes.
In my experience it seems to do quite a lot better in predicting gene models than the Braker2 pipeline with Ergobibamus cyprionides . In addition to gene prediction, it can also facilitate functional annotation (hence the name FUNctional - ANNOTATE)
An additional advantage is that it has the capacity to prepare all the files necessary for a NCBI GenBank submission.
Official documentation can be found in their ReadTheDocs
Clean, sort, mask and train
If you have a genome assembly in plain FASTA format ready, as well as some RNAseq data, you can follow along with the Funannotate tutorial described here.
Here I've adapted those commands so they work with our cluster Perun. Note that all these code snippets below are also represented in the Gospel Of Andrew
Clean
The first step is funannotate clean. This algorithm attempts to find and delete short contigs which are 'repetitive', that is they are already fully represented in a larger contig (≥ 95% sequence similarity and ≥ 95% sequence coverage overlap).
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 1
source activate funannotate
# input
GENOME='ergo_cyp_genome.fasta'
# run funannotate
funannotate clean \
--input $GENOME \
--out ${GENOME/fasta/clean.fasta}
Sort
The second step is funannotate sort. It will sort your contigs from longest-to-shortest within the FASTA file and rename the contig headers. Apparently NCBI limits the number of characters for FASTA headers to 16, and AUGUSTUS can also have issues with longer contig names. If you do not wish to rename your sequences, you can alternatively pass –simplify or -s, which simply splits headers at the first space. (So if your headers do not have a space to begin with, they will remain the same).
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 1
source activate funannotate
# input
GENOME='ergo_cyp_genome.fasta'
## sort by length
funannotate sort \
--input ${GENOME/fasta/clean.fasta} \
--out ${GENOME/fasta/sort.fasta} \
--simplify
Mask
The third step is to mask your genome assembly. This is analogous to the RepeatMasking step of the Braker2 pipeline. The funannotate mask command default is to run simple masking using tantan. The script is a wrapper for RepeatModeler and RepeatMasker, however you can use any external program to softmask your assembly. Softmasking is where repeats are represented by lowercase letters and all non-repetitive regions are uppercase letters.
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 10
source activate funannotate
# input
GENOME='ergo_cyp_genome.fasta'
THREADS=10
# soft-mask with tantan
funannotate mask \
--input ${GENOME/fasta/sort.fasta} \
--out ${GENOME/fasta/mask.fasta} \
--method tantan \
--cpus $THREADS
Train
In this funannotate train step, you will map your RNAseq data against the sorted, cleaned and masked assembly with Hisat2, do a genome-guided transcriptome assembly with Trinity and another transcriptome assembly with PASA, all in a single step:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -q 768G
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40
source activate funannotate
# input
GENOME='ergo_cyp_genome.fasta'
RNASEQ_PRD_FW='eb_rna_fw.prd.fastq.gz'
RNASEQ_PRD_RV='eb_rna_rv.prd.fastq.gz'
GENUS='Ergobibamus'
SPECIES='cyprinoides'
STRAIN='CL'
THREADS=40
## align rnaseq data, run trinity and pasa
funannotate train \
--input ${GENOME/fasta/mask.fasta} \
--out funannotate_out \
--left $RNASEQ_PRD_FW \
--right $RNASEQ_PRD_RV \
--stranded RF --jaccard_clip \
--max_intronlen 3000 \
--species "$GENUS $SPECIES" \
--strain $STRAIN \
--cpus $THREADS \
--no_normalize_reads \
--no_trimmomatic \
--memory 100G
This step requires quite a bit of memory, and hence we're asking for the 768G RAM node of Perun. In the above snippet we assume that the RNAseq reads have already been treated with Trimmomatic. If they have not, and they are still raw reads, funannotate should be able to do the trimmomatic cleaning for you (though I have not tested this). We are also running with –jaccard_clip , which is apparently used when you are expecting a high density genome (like that of yeast, for example)
NOTE also that this step generates the `funannotate_out` output directory, which can be used as an input argument in future funannotate jobs.
Predict
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40
source activate funannotate
# input
GENOME='ergo_cyp_genome.fasta'
GENUS='Ergobibamus'
SPECIES='cyprinoides'
STRAIN='CL'
THREADS=40
export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/
# run funannotate
funannotate predict \
--input ${GENOME/fasta/mask.fasta} \
--out funannotate_out \
--species "$GENUS $SPECIES" \
--strain $STRAIN \
--cpus $THREADS
Many of the required inputs do not have to be explicitly specified, since they have been generated in the previous funannotate train step and they are available in the funannotate_out directory.
Funannotate (according to log files) uses AUGUSTUS, CodingQuarry, GeneMark-ES, GlimmerHMM and SNAP as main gene model predictors. Note that this is more than used by Braker2, which mainly uses AUGUSTUS and GeneMark. This may be a reason why the Funannotate prediction pipeline seems to do better than Braker2 for Ergobibamus.
Each program uses a different source for training its algorithms. Each of these have been generated in the previous step.
| Algorithm | Training-Method |
|---|---|
| AUGUSTUS | PASA transcript-to-genome alignments |
| CodingQuarry | RNAseq-vs-genome sorted BAM file and/or StringTie alignments |
| GeneMark-ES | self-training |
| GlimmerHMM | PASA transcript-to-genome alignments |
| SNAP | PASA transcript-to-genome alignments |
As with Braker2 pipeline, funannotate uses the EVidenceModeler (EVM) to compile the best overall gene models from the above set of gene predictions. For Ergobibamus the weights looked something like this:
Source Weight Count Augustus 1 4462 Augustus HiQ 2 660 CodingQuarry 2 6325 GeneMark 1 5433 GlimmerHMM 1 5830 pasa 6 6470 snap 1 3924 Total - 33104
In some of the final steps, funannotate calls upon tRNAscan (I'm guessing the SE version) to predict tRNA genes. Then at the very end, if you have specified it to, it can create GenBank format-compliant files for you to submit to GenBank. Awesome!
