Background: Trinotate is a comprehensive annotation suite designed for automatic functional annotation of de novo Transcriptome assemblies from novel organisms or model organsims created using the Trinity assembly program. Trinotate makes use of a number of different well refernced methods for functional annotation including homology search to known sequence data (NCBI-BLAST), protein domain identification (HMMER/PFAM), protein signal prediction (singalP/tmHMM), and comparison to currently currated annotation databases (EMBL Uniprot eggNOG/GO Pathways databases). All functional annotation data derived from the analysis of a Trinity Transcriptome Assembly is integrated into a sqlLITE database which allows fast efficient searching for terms with specific qualities related to a desired scientific hypothesis or a means to create a whole annotation report for a transcriptome.

Trinotate is included in the Trinity package under $TRINITY_HOME/Analysis/FunctionalAnnotation/.

1. Table of Contents

Software and Data Required
Sequence Databases Required
Running Sequence Analyses
Trinotate: Loading results into an SQLite database
Trinotate: Output an annotation report
Trinotate: Sample data and execution
Literature references for software leveraged for functional annotation

Software and Data Required

1. Software Required

Trinity (Trinotate is bundled with the distribution): http://trinityrnaseq.sourceforge.net/

sqlite (required for database integration): http://www.sqlite.org/

NCBI Blast: Blast database Homology Search: http://www.ncbi.nlm.nih.gov/books/NBK52640/

HMMER/PFAM Protein Domain Identification: http://hmmer.janelia.org/software

signalP v4 (free academic download) http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp

You should edit the following line to read like so, increasing the max number of entries that can be processed:
my $MAX_ALLOWED_ENTRIES=2000000;  # default is only 2000

tmhmm v2 (free academic download) http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm

You might need to edit the header lines of the scripts 'tmhmm' and 'tmhmmformat.pl' to read:
#!/usr/bin/env perl

2. Sequence Databases Required

SwissProt ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

be sure the search database is properly formatted by running the following (requires that blast+ is already installed as indicated above)
makeblastdb -in uniprot_trembl.fasta -dbtype prot

Pfam domains ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz

If you don't have the Pfam database installed, be sure to download, uncompress it, and prepare it for use with 'hmmscan' like so:
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm

Running Sequence Analyses

1. Trinity files needed for execution

Trinity.fasta - Final product containing all the transcripts assembled by Trinity

Trinity.fasta.transdecoder.pep: Most likely Longest-ORF peptide candidates generated from the Trinity Assembly. Instructions for generation of this file can be found here: http://trinityrnaseq.sourceforge.net/analysis/extract_proteins_from_trinity_transcripts.html

Note

Transdecoder is included in Trinity at $TRINITY_HOME/trinity-plugins/transdecoder/ ; Newer versions of Trinotate will generate a file Trinity.fasta.transdecoder.pep. Earlier versions will generate an equivalent file called best_candidates.eclipsed_orfs_removed.pep. These should be treated as equivalent outputs.

2. Capturing BLAST Homologies

BLAST information Instructions for installation of command line stand alone blast can be found here: http://www.ncbi.nlm.nih.gov/books/NBK52640/ NOTE: This step will undoubtedly take the longest, for very large files execution on a multi-cpu server HPC environment is highly recommended, and your thread count should be equal to the number of CPU’s present on the node the job is run on.

Blast Commands

blastp -query best_candidates.eclipsed_orfs_removed.pep -db SwissProtFormated -num_threads 8 -max_target_seqs 1 -outfmt 6 > TrinotateBlast.out

Note	num_threads should be equal to the amount of cores your computer / computaional node has

3. Running HMMER to identify protein domains

hmmscan (HMMER) command:

hmmscan --cpu 8 --domtblout TrinotatePFAM.out PfamA26.hmm best_candidates.eclipsed_orfs_removed.pep > pfam.log

Note	num_threads should be equal to the amount of cores your computer/computaional node has

4. Running signalP to predict signal peptides

signalP command:

signalp -f short -n signalp.out Trinity.fasta.transdecoder.pep

5. Running tmHMM to predict transmembrane regions

tmhmm command:

tmhmm --short < best_candidates.pep > tmhmm.out

Trinotate: Loading Above Results into a Trinotate SQLite Database

The following commands will import the results from the bioinformatic computes performed above into a Trinotate SQLite database, which is used for generating the final summary output file, and which we expect to support some additional downstream applications under development.

Note, the Trinotate.pl script can be found in the Trinity software package as:

$TRINITY_HOME/Analysis/FunctionalAnnotation/Trinotate.pl

usage: /Users/bhaas/SVN/trinityrnaseq/trunk/Analysis/FunctionalAnnotation/Trinotate.pl <command> <input> [...]

<command>:  LOAD_transdecoder
            LOAD_blast
            LOAD_pfam
            LOAD_tmhmm
            LOAD_signalp
            report

1. Retrieve the Trinotate Pre-generated Resource SQLite database

A pregenerated sqlite database that contains swissprot-related annotation information is available from the Trinity ftp site. Pull it down like so:

wget "http://sourceforge.net/projects/trinityrnaseq/files/misc/TRINOTATE_RESOURCES/TrinityFunctional.swissprot.2012-02-13.db.gz/download"

Rename it like so:

uncompress it and rename it as’TrinityFunctional.db'

gunzip TrinityFunctional.swissprot.2012-02-13.db.gz

mv TrinityFunctional.swissprot.2012-02-13.db TrinityFunctional.db

2. Load the Transdecoder Trinity Peptide Predictions

Trinotate.pl LOAD_transdecoder Trinity.fasta.transdecoder.pep

3. Loading BLAST homologies

Trinotate.pl LOAD_blast TrinotateBlast.out

4. Load Pfam domain entries

Trinotate.pl LOAD_pfam TrinotatePFAM.out

5. Load transmembrane domains

Trinotate.pl LOAD_tmhmm tmhmm.out

6. Load signal peptide predictions

Trinotate.pl LOAD_signalp signalp.out

Trinotate: Output an Annotation Report

Trinotate.pl report [opts] > trinotate_annotation_report.xls

Note, you can threshold the blast and pfam results to be reported by including the options below:

##################################################################
#
#  -E <float>                 maximum E-value for reporting best blast hit
#                             and associated annotations.
#
#  --pfam_cutoff <string>     'DNC' : domain noise cutoff (default)
#                             'DGC' : domain gathering cutoff
#                             'DTC' : domain trusted cutoff
#                             'SNC' : sequence noise cutoff
#                             'SGC' : sequence gathering cutoff
#                             'STC' : sequence trusted cutoff
#
##################################################################

The output has the following column headers:

0   #component
1   trans_derived
2   prot_id
3   TopBlastHit
4   Pfam
5   SignalP
6   TmHMM
7   eggnog
8   gene_ontology
9   prot_seq

and the data are formatted like so:

# example 1

0   comp2700_c0
1   comp2700_c0_seq1:2-4099(+)
2   m.1814
3   sp|Q8E5V5|GLGA_STRA3`Q8E5V5`Q:270-720,H:15-474`25.26%ID`E:2e-17`RecName: Full=Glycogen synthase; EC=2.4.1.21; AltName: Full=Starch [bacterial glycogen] synthase;`Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus.
4   PF08323.6^Glyco_transf_5^Starch synthase catalytic domain^253-458^E:9.5e-18`PF00534.15^Glycos_transf_1^Glycosyl transferases group 1^537-659^E:1.7e-09
5   sigP:1`18`0.816`YES
6   ExpAA=279.46`PredHel=13`Topology=o172-194i932-954o969-991i996-1018o1028-1050i1062-1084o1104-1126i1152-1174o1189-1208i1221-1243o1258-1280i1287-1309o1335-1357i
7   COG0297^ Glycogen synthase
8   GO:0009011^molecular_function^starch synthase activity`GO:0005978^biological_process^glycogen biosynthetic process
9   KKSFLFQFLFGWIVLSSAQWLSVLDEAENLNSSFSLESVDSFAPVRPRFIIDEDFAEDYNLTVDILHRPLQENFDSFFPNVEAYVESGNSNGDLMSDNGKLDDLNSRAAYSALKALQNSYGSSHLYRFTPYELFGQSIWIEEAPEVNHVGWSIMFDNLGRYFLLELRGLREVTFALFITFSIVPIITGILTVYIYKKKYCVIKFNKSGRSKKKDSWLKRSKDELLRTDSANLLTLNDNDEPVMIRHSCKRTCILFATLEYNIPEWNIKIKIGGLGVMAELMSKTLKQYDLVWVVPCVGDITYPVAETAPSLVVKVVNQDYEVKVFYHYKDNIKYVLLDSPIFRKRTSHDPYPPRMDDISSAIFYSVWNQSIAAIIRREKVDIYHINDYHGALAPIYNLPEVIPCAISLHNAEFQGLWPLQSSIDEREVCGLFNVSKTICREYIQFGNAFNLMHCGVSYVRRHQSGYGVVGVSNKYGQRSWVRYPVFWSLKKIGQLPNPDPTDIGLSVNPVNQQLPDFAEYASVRKENKRKAQEWAGLTIDDEADLLVFVGRWSVQKGIDILADLAPTLLEKFNIQLIVVGPLIDLYGKFAAEKFMYIMERYPGRVFSKPEFVHLPPFIFEGADFALIPSRDEPFGLVAVEFGRRGAICIGSRVGGLGEMPGWWYSVESSSTAYLLKQLEKSCTLALKSTPEMRHKLRIAALQQRFPVDEWVALYDRLIRNCIKAHNKQQQRRSIKSFFSCITPNKPTKDVNDILSEKSAFSPADYEHSIDIREHTSYDANSMDNDSDEDNYEQAESIISSLSSSALSELSYISESSMNIGSRLDERFIDANGVAIRDFSAELTYLTPENSKGKLSIDHFLNKVQSRWHDEEHHYYKTGFRKRVYKYLKIKDKKSKDVDPDDDLVNQLPLNAYTKPRYKSAASTRLNIYQRILYLKVFTWPLYTIFLSLGQILSISSFQLSLLSGFEDNNQISLYVITGVFILSTIVWWGLYRNLPSVHSLSLPFAVYALAFLFTGISSMSLPYHIRGWLSYAATWVYAIAAASGPLYFTLNFEDEHCSGLGSSITRACVLQGVQQLWLSFLWLWGTLSSRLDYNYKVLLQPINSVYVVAGVWPVSFVLLSVCILLYKGLPPFYRQKPGSIPAFSKSLLHRKVVICFLISVINQNFWMSTLISQAWRFFWGSKLTKLWKIVVMTVSFLVGAWLIIFYVLRKLSNKHTWMVPVLGLGFGAIKWMHVFWGTSNVGIFLPWAGIAGPYLSRALWLWLGILDSIQGIGNGLILLQTLSRRHVTNTLMISQLAGSATSILARFVSPTKTGPANVFPDLTGYTPVDRAKPVANAPFWICLILNVALCIMYLRCYHRENISRP*

# example 2

0   comp1507_c0
1   comp1507_c0_seq1:405-1415(+)
2   m.772
3   sp|Q7Z8R5|PALI_YARLI`Q7Z8R5`Q:1-236,H:3-234`30.13%ID`E:3e-21`RecName: Full=pH-response regulator protein palI/RIM9;`Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Dipodascaceae; Yarrowia.
4   PF06687.7^SUR7^SUR7/PalI family^7-165^E:2.4e-18
5   sigP:1`23`0.564`YES
6   ExpAA=91.29`PredHel=4`Topology=i9-31o89-111i118-140o150-172i
7   NOG12793^ Calcium ion binding protein
8   GO:0016021^cellular_component^integral to membrane`GO:0005886^cellular_component^plasma membrane
9   MRIRSATPSLILLVIAIVFFVLAICTPPLANNLTLGKYGDVRFGVFGYCLNSNCSKPLVGYNSDYLDEHAKDGFRTSVIVRQRASYGLVIVPVSACICLISTIMTIFAHIGAIARSPGFFNVIGTITFFNIFITAIAFVICVITFVPHIQWPSWLVLANVGIQLIVLLLLLVARRQATRLQAKHLRRATSGSLGYNPYSLQNSSNIFSTSSRKGDLPKFSDYSAEKPMYDTISEDDGLKRGGSVSKLKPTFSNDSRSLSSYAPTVREPVPVPKSNSGFRFPFMRNKPAEQAPENPFRDPENPFKDPASAPAPNPWSINDVQANNDKKPSRFSWGRS*

Backticks (`) and carets (^) are used as delimiters for data packed within an individual field, such as separating E-values, percent identity, and taxonomic info for best matches. When there are multiple assignments in a given field, the assignments are separated by (\`) and the fields within an assignment are separated by (\^). In a future release (post Feb-2013), the backticks and carets will be used more uniformly than above, such as carets as BLAST field separators, and including more than the top hit.

Trinotate: Sample data and execution

Sample data and a runMe.sh script are available at $TRINITY_HOME/Analysis/FunctionalAnnotation/sample_data

Executing the runMe.sh script will pull down the Trinotate sqlite database, populate with the provided bioinformatics computes, and generate the final Trinotate annotation report.

Literature references for software used for functional annotation

[Trinity]Full-length transcriptome assembly from RNA-Seq data without a reference genome. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Nature Biotechnology 29, 644�652 (2011)
[HMMER]HMMER web server: interactive sequence similarity searching R.D. Finn, J. Clements, S.R. Eddy Nucleic Acids Research (2011) Web Server Issue 39:W29-W37.
[PFAM] The Pfam protein families database Punta, P.C. Coggill, R.Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, Ceric, J. Clements, A. Heger, L. Holm, E.L.L. Sonnhammer, S.R. Eddy, A. Bateman, R.D. Finn Nucleic Acids Research (2012) Database Issue 40:D290-D301
[SignalP]SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, Soren Brunak, Gunnar von Heijne & Henrik Nielsen Nature Methods, 8:785-786, 2011
[tmHMM]Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. J Mol Biol. 2001 Jan 19;305(3):567-80.
[BLAST]Basic local alignment search tool. Altschul SF; Gish W; Miller W; Myers EW; Lipman DJ J Mol Biol 215: 403-10 (1990)
[KEGG]KEGG for integration and interpretation of large-scale molecular datasets. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., and Tanabe, M.; Nucleic Acids Res. 40, D109-D114 (2012).
[GO]Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium Nature Genet. 25: 25-29 (2000)
[eggNOG]eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, von Mering C, Bork P. Nucleic Acids Res. 2012 Jan;40(Database issue):D284-9. Epub 2011 Nov 16.

Trinotate: Trinity Transcriptome Functional Annotation