This is an old revision of the document!
Table of Contents
Functional annotation with the Funannotate pipeline
Joran Martijn (April 2023) modified December 2024 by Kathy Dunn
Funannotate is a gene prediction, functional annotation, and genome comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes.
For doing gene prediction with Funannotate, check this other Wiki entry I wrote.
After finishing up your gene predictions, you may want to figure out what these genes actually encode. This is where functional annotation comes into play. Essentially, functional annotation entails a number of similarity searches (BLAST, HMMer) to various databases (SwissProt, Interpro, Pfam, EggNOG, NCBI Refseq Genomes / NR, etc) and protein sequence analyses to predict particular properties (e.g. SignalP, TMHMM, Phobius, antiSMASH etc) as well as other kinds of analyses (GO ontology etc), to best guess how each of our predicted proteins functions in the cell.
Funannotate wraps up lots of these sequence analyses neatly in a few commands. It also produces GenBank, Sequin and other files that allow you to easily submit your annotations to GenBank.
In my experience it was easiest to do a separate InterProScan search, EggNOG mapping and SignalP prediction, and then pointing to each of these resulting outfiles in the final funannotate annotate step.
Important note!!!
Before proceeding you need to check your gff3 file for errors by running validate_gene_models_in_gff3.py
This script will locate errors in your gff3 file that often occur due to manual editing (premature stop codons, incorrect exon numbering, missing start and stop codons, start or stop location not matching exon location, etc). Incorrect phase designation for an exon can lead to premature stop codons so keep that in mind when looking for causes of premature stops.
Pre-step to prepare files
To get the results from InterProScan, EggNOG mapper and SignalP to integrate into funannotate annotate results you need to prepare the gff3 file and protein data using the two funannotate scripts below
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
source activate funannotate
funannotate util gff-rename \
--gff3 {current.gff3} \
--fasta {genome_file_masked.fasta} \
--locus_tag {NCBI assigned locus tag or similar} \
--out {renamed.gff3}
funannotate util gff2prot \
--gff3 {renamed.gff3} \
--fasta {genome_file_masked.fasta} \
--no_stop \
> {protein_file.faa}
You will use the protein_file.faa generated in the below InterProScan –input, EggNOG mapping -i, and SignalP –fastafile scripts, and the renamed.gff3 file in the funannotate annotate script below (see special note).
If you skip the above steps your results from InterProScan, EggNOG mapping and SignalP will not appear in the final output from funannotate annotate!!
InterProScan
InterProScan will check which InterPro and/or Pfam domains are present in each of your proteins.
NOTE if your genome required edits such that you did not simply run funannote to get you gene models, you can substitute the funannotate folder (–input test_funannotate_out) below with the protein coding fasta file generated above (–input protein_file.faa)
Funannotate includes a wrapper for executing the interproscan.sh script:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q 256G-batch
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40
source activate funannotate
IPR_PATH='/scratch2/software/interproscan-5.52-86.0/interproscan.sh'
THREADS=40
OUTXML='results/36_interproscan-5.52-86.0/iprscan.xml'
# run funannotate
## ensure that you point to an interproscan installation
## outside of any conda environment
funannotate iprscan \
--input test_funannotate_out \
--method local \
--iprscan_path $IPR_PATH \
--cpus $THREADS \
--out $OUTXML
NOTE: Your –input can be either a pre-existing funannotate output directory (for instance if you run this step after you already ran another funannotate command earlier) or a plain protein FASTA file.
NOTE: When you specify your $OUTXML in the bash script above, make sure that any directory defined in the path already exists. funannotate iprscan will not automatically generate output directories if they don't exist yet and will not finish properly.
EggNOG mapping
The eggnog mapping algorithm (emapper.py) aims to find out which orthologous group (aka gene family) of the eggNOG database each protein of your predicted proteins belongs to.
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q 256G-batch
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40
source activate eggnog-mapper-2.1.4
# input
EGGNOG_DB='/scratch4/db/eggnog-mapper-2.1.4'
PROTEINS='Ergobibamus_cyprinoides_CL.proteins.fa'
THREADS=40
OUTPUT_DIR='emapper_out'
PREFIX='ergo_emapper'
# the mapper doesn't create an outdir
# automatically? create one here
[[ ! -d "$OUTPUT_DIR" ]] && mkdir -p "$OUTPUT_DIR"
# run eggnog mapper
emapper.py \
-i $PROTEINS \
--data_dir $EGGNOG_DB \
--output_dir $OUTPUT_DIR \
--output $PREFIX \
--cpu $THREADS \
-m diamond \
--dbmem
Instead of -m diamond you may want to consider using -m hmmer, which should be a bit more sensitive.
SignalP
The code below runs SignalP v6.0, which use sophisticated methods to find proteins that have N-terminal signal peptides, and are thus destined to be located to endoplasmatic reticulum or secreted outside the cell.
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q 256G-batch
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 8
source activate signalp6-fast
PROTEINS='Ergobibamus_cyprinoides_CL.proteins.fa'
OUTDIR='signalp_out'
THREADS=8
# run signalp6
signalp6 \
--fastafile $PROTEINS \
--output_dir $OUTDIR \
--format all \
--organism eukarya \
--mode fast \
--write_procs $THREADS
Funannotate annotate
Finally, we can integrate our above analyses with other annotate steps with the single funannotate command below:
If you have used funannotate to predict your genes, you should have a funannotate directory, where all related output files are stored. It is called upon here as well:
Special note: If you have not generated your gene models solely from funannotate then rather than point at the funannotate directory you will supply the gff (–gff) and fasta (–fasta) file in replace of –input second example script
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q 256G-batch
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40
source activate funannotate
FUN_DIR='funannotate_out'
## --eggnog asks for the '.annotations' file
EGGNOG_RESULTS='emapper_out/ergo_emapper.emapper.annotations'
SIGNALP_RESULTS='results/35_signalp-6.0g/prediction_results.txt'
IPRSCAN_RESULTS='results/36_interproscan-5.52-86.0/iprscan.xml'
SBT_FILE='misc/template.sbt' # necessary for NCBI GenBank submission
THREADS=40
# run funannotate
funannotate annotate \
--input $FUN_DIR \
--eggnog $EGGNOG_RESULTS \
--signalp $SIGNALP_RESULTS \
--iprscan $IPRSCAN_RESULTS \
--busco_db eukaryota \
--cpus $THREADS \
--sbt $SBT_FILE
If your gene models have not been called solely by funannotate you can use your gff3 and genome.fasta files instead
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q 256G-batch
#$ -pe threaded 40
source activate funannotate
## --eggnog asks for the '.annotations' file
EGGNOG_RESULTS='functional_annotation/emapper_out/BlastoST2_emapper.emapper.annotations'
SIGNALP_RESULTS='functional_annotation/signalp_out/prediction_results.txt'
IPRSCAN_RESULTS='functional_annotation/interpro_results/iprscan_results.xml'
THREADS=40
funannotate annotate \
--gff renamed.gff3 \
--fasta genome_masked.fasta \
--species Blastocystis_ST2 \
--out functional_annotation \
--eggnog $EGGNOG_RESULTS \
--signalp $SIGNALP_RESULTS \
--iprscan $IPRSCAN_RESULTS \
--busco_db eukaryota \
--cpus $THREADS
