This is an old revision of the document!
Table of Contents
Gene prediction with the Funannotate pipeline
Joran Martijn (December 2022)
Funannotate is a genome prediction, annotation, and comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes.
In my experience it seems to do quite a lot better in predicting gene models than the Braker2 pipeline with Ergobibamus cyprionides . In addition to gene prediction, it can also facilitate functional annotation (hence the name FUNctional - ANNOTATE)
An additional advantage is that it has the capacity to prepare all the files necessary for a NCBI GenBank submission.
Official documentation can be found in their ReadTheDocs
Clean, sort, mask and train
If you have a genome assembly in plain FASTA format ready, as well as some RNAseq data, you can follow along with the Funannotate tutorial described here.
Here I've adapted those commands so they work with our cluster Perun. Note that all these code snippets below are also represented in the Gospel Of Andrew
Clean
The first step is funannotate clean. This algorithm attempts to find and delete short contigs which are 'repetitive', that is they are already fully represented in a larger contig (≥ 95% sequence similarity and ≥ 95% sequence coverage overlap).
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 1
source activate funannotate
# input
GENOME='ergo_cyp_genome.fasta'
# run funannotate
funannotate clean \
--input $GENOME \
--out ${GENOME/fasta/clean.fasta}
Sort
The second step is funannotate sort. It will sort your contigs from longest-to-shortest within the FASTA file and rename the contig headers. Apparently NCBI limits the number of characters for FASTA headers to 16, and AUGUSTUS can also have issues with longer contig names. If you do not wish to rename your sequences, you can alternatively pass –simplify or -s, which simply splits headers at the first space. (So if your headers do not have a space to begin with, they will remain the same).
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 1
source activate funannotate
# input
GENOME='ergo_cyp_genome.fasta'
## sort by length
funannotate sort \
--input ${GENOME/fasta/clean.fasta} \
--out ${GENOME/fasta/sort.fasta} \
--simplify
Mask
The third step is to mask your genome assembly. This is analogous to the RepeatMasking step of the Braker2 pipeline. The funannotate mask command default is to run simple masking using tantan. The script is a wrapper for RepeatModeler and RepeatMasker, however you can use any external program to softmask your assembly. Softmasking is where repeats are represented by lowercase letters and all non-repetitive regions are uppercase letters.
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 10
source activate funannotate
# input
GENOME='ergo_cyp_genome.fasta'
THREADS=10
# soft-mask with tantan
funannotate mask \
--input ${GENOME/fasta/sort.fasta} \
--out ${GENOME/fasta/mask.fasta} \
--method tantan \
--cpus $THREADS
