This is an old revision of the document!

Gene prediction with the Funannotate pipeline

Joran Martijn (December 2022)

Funannotate is a genome prediction, annotation, and comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes.

In my experience it seems to do quite a lot better in predicting gene models than the Braker2 pipeline with Ergobibamus cyprionides . In addition to gene prediction, it can also facilitate functional annotation (hence the name FUNctional - ANNOTATE)

An additional advantage is that it has the capacity to prepare all the files necessary for a NCBI GenBank submission.

Official documentation can be found in their ReadTheDocs

Clean, sort, mask and train

If you have a genome assembly in plain FASTA format ready, as well as some RNAseq data, you can follow along with the Funannotate tutorial described here.

Here I've adapted those commands so they work with our cluster Perun. Note that all these code snippets below are also represented in the Gospel Of Andrew

Clean

The first step is funannotate clean. This algorithm attempts to find and delete short contigs which are 'repetitive', that is they are already fully represented in a larger contig (≥ 95% sequence similarity and ≥ 95% sequence coverage overlap).

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 1

source activate funannotate

# input
GENOME='ergo_cyp_genome.fasta'

# run funannotate
funannotate clean \
    --input $GENOME \
    --out ${GENOME/fasta/clean.fasta}

Sort

The second step is funannotate sort. It will sort your contigs from longest-to-shortest within the FASTA file and rename the contig headers. Apparently NCBI limits the number of characters for FASTA headers to 16, and AUGUSTUS can also have issues with longer contig names. If you do not wish to rename your sequences, you can alternatively pass –simplify or -s, which simply splits headers at the first space. (So if your headers do not have a space to begin with, they will remain the same).

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 1

source activate funannotate

# input
GENOME='ergo_cyp_genome.fasta'

## sort by length
funannotate sort \
    --input ${GENOME/fasta/clean.fasta} \
    --out ${GENOME/fasta/sort.fasta} \
    --simplify

Mask

The third step is to mask your genome assembly. This is analogous to the RepeatMasking step of the Braker2 pipeline. The funannotate mask command default is to run simple masking using tantan. The script is a wrapper for RepeatModeler and RepeatMasker, however you can use any external program to softmask your assembly. Softmasking is where repeats are represented by lowercase letters and all non-repetitive regions are uppercase letters.

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 10

source activate funannotate

# input
GENOME='ergo_cyp_genome.fasta'
THREADS=10

# soft-mask with tantan
funannotate mask \
    --input ${GENOME/fasta/sort.fasta} \
    --out ${GENOME/fasta/mask.fasta} \
    --method tantan \
    --cpus $THREADS

Train

In this funannotate train step, you will map your RNAseq data against the sorted, cleaned and masked assembly with Hisat2, do a genome-guided transcriptome assembly with Trinity and another transcriptome assembly with PASA, all in a single step:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -q 768G
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40

source activate funannotate

# input
GENOME='ergo_cyp_genome.fasta'
RNASEQ_PRD_FW='eb_rna_fw.prd.fastq.gz'
RNASEQ_PRD_RV='eb_rna_rv.prd.fastq.gz'
GENUS='Ergobibamus'
SPECIES='cyprinoides'
STRAIN='CL'
THREADS=40

## align rnaseq data, run trinity and pasa
funannotate train \
    --input ${GENOME/fasta/mask.fasta} \
    --out funannotate_out \
    --left $RNASEQ_PRD_FW \
    --right $RNASEQ_PRD_RV \
    --stranded RF --jaccard_clip \
    --max_intronlen 3000 \
    --species "$GENUS $SPECIES" \
    --strain $STRAIN \
    --cpus $THREADS  \
    --no_normalize_reads \
    --no_trimmomatic \
    --memory 100G

This step requires quite a bit of memory, and hence we're asking for the 768G RAM node of Perun. In the above snippet we assume that the RNAseq reads have already been treated with Trimmomatic. If they have not, and they are still raw reads, funannotate should be able to do the trimmomatic cleaning for you (though I have not tested this). We are also running with –jaccard_clip , which is apparently used when you are expecting a high density genome (like that of yeast, for example)

NOTE also that this step generates the `funannotate_out` output directory, which can be used as an input argument in future funannotate jobs.

Predict

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40

source activate funannotate

# input
GENOME='ergo_cyp_genome.fasta'
GENUS='Ergobibamus'
SPECIES='cyprinoides'
STRAIN='CL'
THREADS=40

export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/

# run funannotate
funannotate predict \
    --input ${GENOME/fasta/mask.fasta} \
    --out funannotate_out \
    --species "$GENUS $SPECIES" \
    --strain $STRAIN \
    --cpus $THREADS

Many of the required inputs do not have to be explicitly specified, since they have been generated in the previous funannotate train step and they are available in the funannotate_out directory.

Funannotate (according to log files) uses AUGUSTUS, CodingQuarry, GeneMark-ES, GlimmerHMM and SNAP as main gene model predictors. Note that this is more than used by Braker2, which mainly uses AUGUSTUS and GeneMark. This may be a reason why the Funannotate prediction pipeline seems to do better than Braker2 for Ergobibamus.

Each program uses a different source for training its algorithms. Each of these have been generated in the previous step.

Algorithm	Training-Method
AUGUSTUS	PASA transcript-to-genome alignments
CodingQuarry	RNAseq-vs-genome sorted BAM file and/or StringTie alignments
GeneMark-ES	self-training
GlimmerHMM	PASA transcript-to-genome alignments
SNAP	PASA transcript-to-genome alignments

As with Braker2 pipeline, funannotate uses the EVidenceModeler (EVM) to compile the best overall gene models from the above set of gene predictions. For Ergobibamus the weights looked something like this:

  Source         Weight   Count
  Augustus       1        4462
  Augustus HiQ   2        660
  CodingQuarry   2        6325
  GeneMark       1        5433
  GlimmerHMM     1        5830
  pasa           6        6470
  snap           1        3924
  Total          -        33104

In some of the final steps, funannotate calls upon tRNAscan (I'm guessing the SE version) to predict tRNA genes. Then at the very end, if you have specified it to, it can create GenBank format-compliant files for you to submit to GenBank. Awesome!

If you intend to curate the gene predictions, it is most practical to do this after the funannotate update step (see below)

Update

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m bea
#$ -M joran.martijn@dal.ca
#$ -pe threaded 40

source activate funannotate

# input
FUNDIR='funannotate_out'
THREADS=40
LOCUS_TAG='PYV62'
SBT='template.sbt'
ACCESSION='JARDW000000000'
MEMORY='240G'

# run funannotate
funannotate update \
    --input $FUNDIR \
    --cpus $THREADS \
    --name $LOCUS_TAG \
    --sbt $SBT \
    --SeqCenter RogerLab \
    --SeqAccession $ACCESSION \
    --no_trimmomatic \
    --memory $MEMORY

funannotate update briefly uses the RNAseq alignments (already present in the funannotate_out directory) to infer the start and end locations of the 5' and 3' untranslated regions (UTRs) associated with each gene feature predicted by funannotate predict.

It will also attempt to fix certain gene models if they are in strong disagreement with the RNAseq data.

You can also optionally provide the locus tag, an SBT file, a WGS accession number to make your final files ready for GenBank submission. To get these, you need to start a genome submission at the Genome Submission Portal. You'll have to make an account with NCBI to do this. Perhaps the most straightforward way is to create an account using your ORCID. If you don't have an ORCID (which stands for Open Researcher and Contributor ID) yet, you can do so here. More and more journals, as well as BioRxiv and other institutions are making use of ORCID, so it may be in general a good idea to make one anyway.

When you are in the Submission Portal, you can start a new submission. You will be asked to fill in several forms, and perhaps create a new BioProject and/or BioSample along the way. It's kind of annoying, but if you intend to publish your genome you'll need to put it on GenBank and this is the most straightforward way to do it.

Shortly after your initial submission you'll receive a locus tag and a WGS accession number. You can create the SBT file by going to Templates on the top right of the submission portal website (in between Manage data and My profile).

Now you should be able to run funannotate update!

cgeb2001's DokuWiki!

Table of Contents