Differences

This shows you the differences between two versions of the page.

--- gene_prediction_with_funannotate [2022/12/14 13:05] – [Clean] 134.190.232.140
+++ gene_prediction_with_funannotate [2026/02/26 12:11] (current) – 129.173.242.70
@@ Line 1: / Line 1: @@
 ====== Gene prediction with the Funannotate pipeline ======
-Joran Martijn (December 2022)
+Created by Joran Martijn in December 2022
+Updated by Jason Shao on February 26th, 2026
 Funannotate is a genome prediction, annotation, and comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes.
-In my experience it seems to do quite a lot better in predicting gene models than the Braker2 pipeline with //Ergobibamus cyprionides// . In addition to gene prediction, it can also facilitate functional annotation (hence the name FUNctional - ANNOTATE)
+In my experience it seems to do quite a lot better in predicting gene models than the Braker2 pipeline with //Ergobibamus cyprionides// . In addition to gene prediction, it can also facilitate functional annotation (hence the name FUNctional - ANNOTATE - though it may also refer to FUNgi, which was its original target clade)
 An additional advantage is that it has the capacity to prepare all the files necessary for a NCBI GenBank submission.
@@ Line 40: / Line 42: @@
 </code>
+==== Sort ====
+The second step is ''funannotate sort''. It will sort your contigs from longest-to-shortest within the FASTA file and rename the contig headers. Apparently NCBI limits the number of characters for FASTA headers to 16, and AUGUSTUS can also have issues with longer contig names. If you do not wish to rename your sequences, you can alternatively pass ''--simplify'' or ''-s'', which simply splits headers at the first space. (So if your headers do not have a space to begin with, they will remain the same).
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+#$ -cwd
+#$ -m bea
+#$ -M joran.martijn@dal.ca
+#$ -pe threaded 1
+source activate funannotate
+# input
+GENOME='ergo_cyp_genome.fasta'
+## sort by length
+funannotate sort \
+    --input ${GENOME/fasta/clean.fasta} \
+    --out ${GENOME/fasta/sort.fasta} \
+    --simplify
+</code>
+==== Mask ====
+The third step is to mask your genome assembly. This is analogous to the RepeatMasking step of the Braker2 pipeline. The ''funannotate mask'' command default is to run simple masking using ''tantan''. The script is a wrapper for RepeatModeler and RepeatMasker, however you can use any external program to softmask your assembly. Softmasking is where repeats are represented by lowercase letters and all non-repetitive regions are uppercase letters.
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+#$ -cwd
+#$ -m bea
+#$ -M joran.martijn@dal.ca
+#$ -pe threaded 10
+source activate funannotate
+# input
+GENOME='ergo_cyp_genome.fasta'
+THREADS=10
+# soft-mask with tantan
+funannotate mask \
+    --input ${GENOME/fasta/sort.fasta} \
+    --out ${GENOME/fasta/mask.fasta} \
+    --method tantan \
+    --cpus $THREADS
+</code>
+==== Train ====
+In this ''funannotate train'' step, you will map your RNAseq data against the sorted, cleaned and masked assembly with **Hisat2**, do a genome-guided transcriptome assembly with **Trinity** and another transcriptome assembly with **PASA**, all in a single step:
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+#$ -cwd
+#$ -m bea
+#$ -q 768G
+#$ -M joran.martijn@dal.ca
+#$ -pe threaded 40
+source activate funannotate
+# input
+GENOME='ergo_cyp_genome.fasta'
+RNASEQ_PRD_FW='eb_rna_fw.prd.fastq.gz'
+RNASEQ_PRD_RV='eb_rna_rv.prd.fastq.gz'
+GENUS='Ergobibamus'
+SPECIES='cyprinoides'
+STRAIN='CL'
+THREADS=40
+## align rnaseq data, run trinity and pasa
+funannotate train \
+    --input ${GENOME/fasta/mask.fasta} \
+    --out funannotate_out \
+    --left $RNASEQ_PRD_FW \
+    --right $RNASEQ_PRD_RV \
+    --stranded RF --jaccard_clip \
+    --max_intronlen 3000 \
+    --species "$GENUS $SPECIES" \
+    --strain $STRAIN \
+    --cpus $THREADS  \
+    --no_normalize_reads \
+    --no_trimmomatic \
+    --memory 100G
+</code>
+This step requires quite a bit of memory, and hence we're asking for the 768G RAM node of Perun. In the above snippet we assume that the RNAseq reads have already been treated with **Trimmomatic**. If they have not, and they are still raw reads, funannotate should be able to do the trimmomatic cleaning for you (though I have not tested this). We are also running with ''--jaccard_clip'' , which is apparently used when you are expecting a high density genome (like that of yeast, for example)
+NOTE also that this step generates the `funannotate_out` output directory, which can be used as an input argument in future funannotate jobs.
+An esoteric error with funannotate 1.8.17 might happen at the PASA step. In which case, check:
+''funannotate_out/pasa-transdecoder.log''
+<code>
+...
+CMD: cdna_alignment_orf_to_genome_orf.pl Blastocystis_ST2_pasa.assemblies.fasta.transdecoder.gff3 Blastocystis_ST2_pasa.pasa_assemblies.gff3 Blastocystis_ST2_pasa.assemblies.fasta > Blastocystis_ST2_pasa.assemblies.fasta.transdecoder.      genome.gff3
+sh: 1: cdna_alignment_orf_to_genome_orf.pl: not found
+Error, cmd: cdna_alignment_orf_to_genome_orf.pl Blastocystis_ST2_pasa.assemblies.fasta.transdecoder.gff3                Blastocystis_ST2_pasa.pasa_assemblies.gff3 Blastocystis_ST2_pasa.assemblies.fasta > Blastocystis_ST2_pasa.assemblies.   fasta.transdecoder.genome.gff3 died with ret 32512 at /home/jasons/.conda/envs/funannotate_jds/opt/pasa-2.5.3/scripts/  pasa_asmbls_to_training_set.dbi line 150.
+</code>
+However, ''cdna_alignment_orf_to_genome_orf.pl'' is indeed shipped with 1.8.17, twice no less!
+A simple fix would be to include this ''export'' statement in your submission script:
+<code>
+export PATH="$CONDA_PREFIX/opt/transdecoder/util:$PATH"
+</code>
+==== Predict ====
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+#$ -cwd
+#$ -m bea
+#$ -M joran.martijn@dal.ca
+#$ -pe threaded 40
+source activate funannotate
+# input
+GENOME='ergo_cyp_genome.fasta'
+GENUS='Ergobibamus'
+SPECIES='cyprinoides'
+STRAIN='CL'
+THREADS=40
+export GENEMARK_PATH=/scratch2/software/gmes_linux_64-aug-2020/
+# run funannotate
+funannotate predict \
+    --input ${GENOME/fasta/mask.fasta} \
+    --out funannotate_out \
+    --species "$GENUS $SPECIES" \
+    --strain $STRAIN \
+    --cpus $THREADS
+</code>
+NOTE: If you are running ''funannotate predict'' outside of the ''funannotate'' conda environment on Perun (for example if you are running it within a distinct contained environment as part of a Snakemake workflow), it may complain that the funannotate database is not properly configured. It is actually already available and properly configured at ''/scratch4/db/funannotate'', but your particular funannotate installation may not know about it. Do ''export FUNANNOTATE_DB="/scratch4/db/funannotate"'' prior to your execution and it should work now.
+To verify the versions of the databases:
+<code>
+funannotate database
+</code>
+If for some reason you need to re-install the databases from scratch, you can do so with:
+<code>
+funannotate setup -d <your_dir>
+</code>
+And if you do this on a shared system, you might receive this error:
+<code>
+urllib.error.HTTPError: HTTP Error 403: Forbidden
+</code>
+This is known issue with GO or possibly other database hosts who deny institutional proxies as "automated scarping".
+The fix is to make the following modifications to appear to be accessing through a regular browser:
+''.conda/envs/funannotate/lib/python3.xx/site-packages/funannotate/setupDB.py''
+<code>
+from urlib.request import urlopen, Request
+...
+req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
+u = urlopen(req)
+</code>
+Make sure not to use tabs for whitespace.
+Many of the required inputs do not have to be explicitly specified, since they have been generated in the previous ''funannotate train'' step and they are available in the ''funannotate_out'' directory.
+Funannotate (according to log files) uses **AUGUSTUS**, **CodingQuarry**, **GeneMark-ES**, **GlimmerHMM** and **SNAP** as main gene model predictors. Note that this is more than used by Braker2, which mainly uses AUGUSTUS and GeneMark. This may be a reason why the Funannotate prediction pipeline seems to do better than Braker2 for //Ergobibamus//.
+Each program uses a different source for training its algorithms. Each of these have been generated in the previous step.
+^ Algorithm    ^ Training-Method                                              ^
+| AUGUSTUS     | PASA transcript-to-genome alignments                         |
+| CodingQuarry | RNAseq-vs-genome sorted BAM file and/or StringTie alignments |
+| GeneMark-ES  | self-training                                                |
+| GlimmerHMM   | PASA transcript-to-genome alignments                         |
+| SNAP         | PASA transcript-to-genome alignments                         |
+As with Braker2 pipeline, funannotate uses the **EVidenceModeler (EVM)** to compile the best overall gene models from the above set of gene predictions. For //Ergobibamus// the weights looked something like this:
+<code>
+  Source         Weight   Count
+  Augustus       1        4462
+  Augustus HiQ   2        660
+  CodingQuarry   2        6325
+  GeneMark       1        5433
+  GlimmerHMM     1        5830
+  pasa           6        6470
+  snap           1        3924
+  Total          -        33104
+</code>
+In some of the final steps, funannotate calls upon ''tRNAscan'' (I'm guessing the SE version) to predict tRNA genes. Then at the very end, if you have specified it to, it can create GenBank format-compliant files for you to submit to GenBank. Awesome!
+If you intend to curate the gene predictions, it is most practical to do this after the ''funannotate update'' step (see below)
+==== Update ====
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+#$ -cwd
+#$ -m bea
+#$ -M joran.martijn@dal.ca
+#$ -pe threaded 40
+source activate funannotate
+# input
+FUNDIR='funannotate_out'
+THREADS=40
+LOCUS_TAG='PYV62'
+SBT='template.sbt'
+ACCESSION='JARDW000000000'
+MEMORY='240G'
+# run funannotate
+funannotate update \
+    --input $FUNDIR \
+    --cpus $THREADS \
+    --name $LOCUS_TAG \
+    --sbt $SBT \
+    --SeqCenter RogerLab \
+    --SeqAccession $ACCESSION \
+    --no_trimmomatic \
+    --memory $MEMORY
+</code>
+''funannotate update'' briefly uses the RNAseq alignments (already present in the ''funannotate_out'' directory) to infer the start and end locations of the 5' and 3' untranslated regions (UTRs) associated with each gene feature predicted by ''funannotate predict''.
+It will also attempt to fix certain gene models if they are in strong disagreement with the RNAseq data.
+You can also optionally provide the **locus tag**, an **SBT file**, a **WGS accession number** to make your final files ready for GenBank submission. To get these, you need to start a genome submission at the [[https://submit.ncbi.nlm.nih.gov/subs/genome/|Genome Submission Portal]]. You'll have to make an account with NCBI to do this. Perhaps the most straightforward way is to create an account using your ORCID. If you don't have an ORCID (which stands for Open Researcher and Contributor ID) yet, you can do so [[https://orcid.org/|here]]. More and more journals, as well as BioRxiv and other institutions are making use of ORCID, so it may be in general a good idea to make one anyway.
+When you are in the Submission Portal, you can start a new submission. You will be asked to fill in several forms, and perhaps create a new BioProject and/or BioSample along the way. It's kind of annoying, but if you intend to publish your genome you'll need to put it on GenBank and this is the most straightforward way to do it.
+Shortly after your initial submission you'll receive a **locus tag** and a **WGS accession number**. You can create the **SBT file** by going to //Templates// on the top right of the submission portal website (in between //Manage data// and //My profile//).
+Now you should be able to run funannotate update!