This is an old revision of the document!

Gene prediction with the Funannotate pipeline

Joran Martijn (December 2022)

Funannotate is a genome prediction, annotation, and comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes.

In my experience it seems to do quite a lot better in predicting gene models than the Braker2 pipeline with Ergobibamus cyprionides . In addition to gene prediction, it can also facilitate functional annotation (hence the name FUNctional - ANNOTATE)

An additional advantage is that it has the capacity to prepare all the files necessary for a NCBI GenBank submission.

Official documentation can be found in their ReadTheDocs

Clean, sort, mask and train

If you have a genome assembly in plain FASTA format ready, as well as some RNAseq data, you can follow along with the Funannotate tutorial described here.

Here I've adapted those commands so they work with our cluster Perun. Note that all these code snippets below are also represented in the Gospel Of Andrew

Clean

The first step is funannotate clean. This algorithm attempts to find and delete short contigs which are 'repetitive', that is they are already fully represented in a larger contig (≥ 95% sequence similarity and ≥ 95% sequence coverage overlap).

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 1

source activate funannotate

# input
GENOME='ergo_cyp_genome.fasta'

# run funannotate
funannotate clean \
    --input $GENOME \
    --out ${GENOME/fasta/clean.fasta}

Sort

The second step is funannotate sort. It will sort your contigs from longest-to-shortest within the FASTA file and rename the contig headers. Apparently NCBI limits the number of characters for FASTA headers to 16, and AUGUSTUS can also have issues with longer contig names. If you do not wish to rename your sequences, you can alternatively pass –simplify or -s, which simply splits headers at the first space. (So if your headers do not have a space to begin with, they will remain the same).

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 1

source activate funannotate

# input
GENOME='ergo_cyp_genome.fasta'

## sort by length
funannotate sort \
    --input ${GENOME/fasta/clean.fasta} \
    --out ${GENOME/fasta/sort.fasta} \
    --simplify

Mask

The third step is to mask your genome assembly. This is analogous to the RepeatMasking step of the Braker2 pipeline. The funannotate mask command default is to run simple masking using tantan. The script is a wrapper for RepeatModeler and RepeatMasker, however you can use any external program to softmask your assembly. Softmasking is where repeats are represented by lowercase letters and all non-repetitive regions are uppercase letters.

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 10

source activate funannotate

# input
GENOME='ergo_cyp_genome.fasta'
THREADS=10

# soft-mask with tantan
funannotate mask \
    --input ${GENOME/fasta/sort.fasta} \
    --out ${GENOME/fasta/mask.fasta} \
    --method tantan \
    --cpus $THREADS

cgeb2001's DokuWiki!

Table of Contents

Gene prediction with the Funannotate pipeline

Clean, sort, mask and train

Clean

Sort

Mask