This is an old revision of the document!

Functional annotation with the Funannotate pipeline

Joran Martijn (April 2023)

Funannotate is a gene prediction, functional annotation, and genome comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes.

For doing gene prediction with Funannotate, check this other Wiki entry I wrote.

After finishing up your gene predictions, you may want to figure out what these genes actually encode. This is where functional annotation comes into play. Essentially, functional annotation entails a number of similarity searches (BLAST, HMMer) to various databases (SwissProt, Interpro, Pfam, EggNOG, NCBI Refseq Genomes / NR, etc) and protein sequence analyses to predict particular properties (e.g. SignalP, TMHMM, Phobius, antiSMASH etc) as well as other kinds of analyses (GO ontology etc), to best guess how each of our predicted proteins functions in the cell.

Funannotate wraps up lots of these sequence analyses neatly in a few commands. It also produces GenBank, Sequin and other files that allow you to easily submit your annotations to GenBank.

In my experience it was easiest to do a separate InterProScan search, EggNOG mapping and SignalP prediction, and then pointing to each of these resulting outfiles in the final funannotate annotate step.

InterProScan

InterProScan will check which InterPro and/or Pfam domains are present in each of your proteins.

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q 256G-batch
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40

source activate funannotate

IPR_PATH='/scratch2/software/interproscan-5.52-86.0/interproscan.sh'
THREADS=40
OUTXML='results/36_interproscan-5.52-86.0/iprscan.xml'

# run funannotate
## ensure that you point to an interproscan installation 
## outside of any conda environment
funannotate iprscan \
    --input test_funannotate_out \
    --method local \
    --iprscan_path $IPR_PATH \
    --cpus $THREADS \
    --out $OUTXML

EggNOG mapping

The eggnog mapping algorithm (emapper.py) aims to find out which orthologous group (aka gene family) of the eggNOG database each protein of your predicted proteins belongs to.

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q 256G-batch
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40

source activate eggnog-mapper-2.1.4

# input
EGGNOG_DB='/scratch4/db/eggnog-mapper-2.1.4'
PROTEINS='Ergobibamus_cyprinoides_CL.proteins.fa'
THREADS=40
OUTPUT_DIR='emapper_out'
PREFIX='ergo_emapper'

# the mapper doesn't create an outdir
# automatically? create one here
[[ ! -d "$OUTPUT_DIR" ]] && mkdir -p "$OUTPUT_DIR"

# run eggnog mapper
emapper.py \
    -i $PROTEINS \
    --data_dir $EGGNOG_DB \
    --output_dir $OUTPUT_DIR \
    --output $PREFIX \
    --cpu $THREADS \
    -m diamond \
    --dbmem

Instead of -m diamond you may want to consider using -m hmmer, which should be a bit more sensitive.

SignalP

The code below runs SignalP v6.0, which use sophisticated methods to find proteins that have N-terminal signal peptides, and are thus destined to be located to endoplasmatic reticulum or secreted outside the cell.

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q 256G-batch
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 8

source activate signalp6-fast

PROTEINS='Ergobibamus_cyprinoides_CL.proteins.fa'
OUTDIR='signalp_out'
THREADS=8

# run signalp6
signalp6 \
    --fastafile $PROTEINS \
    --output_dir $OUTDIR \
    --format all \
    --organism eukarya \
    --mode fast \
    --write_procs $THREADS

cgeb2001's DokuWiki!

Table of Contents

Functional annotation with the Funannotate pipeline

InterProScan

EggNOG mapping

SignalP