User Tools

Site Tools


functional_annotation_with_the_funannotate_pipeline

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
functional_annotation_with_the_funannotate_pipeline [2023/04/07 12:46] – created 134.190.232.186functional_annotation_with_the_funannotate_pipeline [2025/12/09 13:02] (current) – [EggNOG mapping] 134.190.190.181
Line 1: Line 1:
 ====== Functional annotation with the Funannotate pipeline ====== ====== Functional annotation with the Funannotate pipeline ======
  
-Joran Martijn (April 2023)+Joran Martijn (April 2023) modified December 2024 by Kathy Dunn
  
 Funannotate is a gene prediction, functional annotation, and genome comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes.  Funannotate is a gene prediction, functional annotation, and genome comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accommodate larger genomes. 
Line 12: Line 12:
  
 In my experience it was easiest to do a separate InterProScan search, EggNOG mapping and SignalP prediction, and then pointing to each of these resulting outfiles in the final ''funannotate annotate'' step. In my experience it was easiest to do a separate InterProScan search, EggNOG mapping and SignalP prediction, and then pointing to each of these resulting outfiles in the final ''funannotate annotate'' step.
 +
 +**Important note!!!**
 +
 +Before proceeding you need to check your gff3 file for errors by running ''validate_gene_models_in_gff3.py''. (Activate the gffutils environment to run this script, you may also have to comment out the ''import regex as re'' line if you're not checking Blastocystis genomes.  
 + 
 +This script will locate errors in your gff3 file that often occur due to manual editing (premature stop codons, incorrect exon numbering, missing start and stop codons, start or stop location not matching exon location, etc).  Incorrect phase designation for an exon can lead to premature stop codons so keep that in mind when looking for causes of premature stops.  
 +
 +
 +==== Pre-step to prepare files =====
 +
 +To get the results from InterProScan, EggNOG mapper and SignalP to integrate into funannotate annotate results you need to prepare the gff3 file and protein data using the two funannotate scripts below
 +
 +
 +
 +<code>
 +#!/bin/bash
 +#$ -S /bin/bash
 +#$ -cwd
 +
 +source activate funannotate
 +
 +funannotate util gff-rename \
 +            --gff3 {current.gff3} \
 +            --fasta {genome_file_masked.fasta} \
 +            --locus_tag {NCBI assigned locus tag or similar} \
 +            --out {renamed.gff3}
 +            
 +            
 +funannotate util gff2prot \
 +            --gff3 {renamed.gff3} \
 +            --fasta {genome_file_masked.fasta} \
 +            --no_stop \
 +            > {protein_file.faa}
 +</code>           
 +
 +You will use the protein_file.faa generated in the below InterProScan ''--input'', EggNOG mapping ''-i'',  and SignalP ''--fastafile'' scripts, and the renamed.gff3 file in the funannotate annotate script below (see special note).  
 +
 +If you skip the above steps your results from InterProScan, EggNOG mapping and SignalP will not appear in the final output from funannotate annotate!!
  
 ==== InterProScan ==== ==== InterProScan ====
 +
 +InterProScan will check which InterPro and/or Pfam domains are present in each of your proteins.
 +
 +NOTE if your genome required edits such that you did not simply run funannote to get you gene models, you can substitute the funannotate folder (--input test_funannotate_out)  below with the protein coding fasta file generated above (--input protein_file.faa)
 +
 +Funannotate includes a wrapper for executing the ''interproscan.sh'' script:
  
 <code> <code>
Line 41: Line 85:
 </code> </code>
  
 +NOTE: Your ''--input'' can be either a pre-existing funannotate output directory (for instance if you run this step after you already ran another funannotate command earlier) or a plain protein FASTA file.
 +
 +NOTE: When you specify your ''$OUTXML'' in the bash script above, make sure that any directory defined in the path already exists. ''funannotate iprscan'' will not automatically generate output directories if they don't exist yet and will not finish properly.
 ==== EggNOG mapping ==== ==== EggNOG mapping ====
 +
 +The eggnog mapping algorithm (''emapper.py'') aims to find out which orthologous group (aka gene family) of the eggNOG database each protein of your predicted proteins belongs to.
  
 <code> <code>
Line 48: Line 97:
 #$ -cwd #$ -cwd
 #$ -q 256G-batch #$ -q 256G-batch
-#$ -m bea 
-#$ -M joran.martijn@dal.ca 
 #$ -pe threaded 40 #$ -pe threaded 40
  
Line 77: Line 124:
  
 Instead of ''-m diamond'' you may want to consider using ''-m hmmer'', which should be a bit more sensitive. Instead of ''-m diamond'' you may want to consider using ''-m hmmer'', which should be a bit more sensitive.
 +
 +==== SignalP ====
 +
 +The code below runs SignalP v6.0, which use sophisticated methods to find proteins that have N-terminal signal peptides, and are thus destined to be located to endoplasmatic reticulum or secreted outside the cell.
 +
 +<code>
 +#!/bin/bash
 +#$ -S /bin/bash
 +#$ -cwd
 +#$ -q 256G-batch
 +#$ -m bea
 +#$ -M joran.martijn@dal.ca
 +#$ -pe threaded 8
 +
 +source activate signalp6-fast
 +
 +PROTEINS='Ergobibamus_cyprinoides_CL.proteins.fa'
 +OUTDIR='signalp_out'
 +THREADS=8
 +
 +# run signalp6
 +signalp6 \
 +    --fastafile $PROTEINS \
 +    --output_dir $OUTDIR \
 +    --format all \
 +    --organism eukarya \
 +    --mode fast \
 +    --write_procs $THREADS
 +</code>
 +
 +==== Funannotate annotate ====
 +
 +Finally, we can integrate our above analyses with other annotate steps with the single funannotate command below:
 +
 +If you have used funannotate to predict your genes, you should have a funannotate directory, where all related output files are stored. It is called upon here as well:
 +
 +Special note: If you have not generated your gene models solely from funannotate then rather than point at the funannotate directory you will supply the gff (--gff) and fasta (--fasta) file in replace of --input second example script
 +
 +
 +
 +<code>
 +#!/bin/bash
 +#$ -S /bin/bash
 +#$ -cwd
 +#$ -q 256G-batch
 +#$ -m bea
 +#$ -M joran.martijn@dal.ca
 +#$ -pe threaded 40
 +
 +source activate funannotate
 +
 +FUN_DIR='funannotate_out'
 +
 +## --eggnog asks for the '.annotations' file
 +EGGNOG_RESULTS='emapper_out/ergo_emapper.emapper.annotations'
 +
 +SIGNALP_RESULTS='results/35_signalp-6.0g/prediction_results.txt'
 +IPRSCAN_RESULTS='results/36_interproscan-5.52-86.0/iprscan.xml'
 +SBT_FILE='misc/template.sbt' # necessary for NCBI GenBank submission
 +
 +THREADS=40
 +
 +# run funannotate
 +
 +funannotate annotate \
 +    --input $FUN_DIR \
 +    --eggnog $EGGNOG_RESULTS \
 +    --signalp $SIGNALP_RESULTS \
 +    --iprscan $IPRSCAN_RESULTS \
 +    --busco_db eukaryota \
 +    --cpus $THREADS \
 +    --sbt $SBT_FILE
 +
 +</code>
 +
 +If your gene models have not been called solely by funannotate you can use your gff3 and genome.fasta files instead 
 +
 +<code>
 +#!/bin/bash
 +#$ -S /bin/bash
 +#$ -cwd
 +#$ -q 256G-batch
 +#$ -pe threaded 40
 +
 +source activate funannotate
 +
 +## --eggnog asks for the '.annotations' file
 +EGGNOG_RESULTS='functional_annotation/emapper_out/BlastoST2_emapper.emapper.annotations'
 +SIGNALP_RESULTS='functional_annotation/signalp_out/prediction_results.txt'
 +IPRSCAN_RESULTS='functional_annotation/interpro_results/iprscan_results.xml'
 +THREADS=40
 +
 +funannotate annotate \
 +    --gff renamed.gff3 \
 +    --fasta genome_masked.fasta \
 +    --species Blastocystis_ST2 \
 +    --out functional_annotation \
 +    --eggnog $EGGNOG_RESULTS \
 +    --signalp $SIGNALP_RESULTS \
 +    --iprscan $IPRSCAN_RESULTS \
 +    --busco_db eukaryota \
 +    --cpus $THREADS 
 +</code>
 +
functional_annotation_with_the_funannotate_pipeline.1680882376.txt.gz · Last modified: by 134.190.232.186