
Background: Trinotate is a comprehensive annotation suite designed for automatic functional annotation of de novo Transcriptome assemblies from novel organisms or model organsims created using the Trinity assembly program. Trinotate makes use of a number of different well refernced methods for functional annotation including homology search to known sequence data (NCBI-BLAST), protein domain identification (HMMER/PFAM), protein signal prediction (singalP/tmHMM), and comparison to currently currated annotation databases (EMBL Uniprot eggNOG/GO Pathways databases). All functional annotation data derived from the analysis of a Trinity Transcriptome Assembly is integrated into a sqlLITE database which allows fast efficient searching for terms with specific qualities related to a desired scientific hypothesis or a means to create a whole annotation report for a transcriptome.
Trinotate is included in the Trinity package under $TRINITY_HOME/Analysis/FunctionalAnnotation/.
1. Table of Contents
Software and Data Required
1. Software Required
Trinity (Trinotate is bundled with the distribution): http://trinityrnaseq.sourceforge.net/
sqlite (required for database integration): http://www.sqlite.org/
NCBI Blast: Blast database Homology Search: http://www.ncbi.nlm.nih.gov/books/NBK52640/
HMMER/PFAM Protein Domain Identification: http://hmmer.janelia.org/software
signalP v4 (free academic download) http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp
You should edit the following line to read like so, increasing the max number of entries that can be processed:
my $MAX_ALLOWED_ENTRIES=2000000; # default is only 2000
tmhmm v2 (free academic download) http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm
You might need to edit the header lines of the scripts 'tmhmm' and 'tmhmmformat.pl' to read:
#!/usr/bin/env perl
2. Sequence Databases Required
be sure the search database is properly formatted by running the following (requires that blast+ is already installed as indicated above)
makeblastdb -in uniprot_trembl.fasta -dbtype prot
If you don't have the Pfam database installed, be sure to download, uncompress it, and prepare it for use with 'hmmscan' like so:
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm
Running Sequence Analyses
1. Trinity files needed for execution
Trinity.fasta - Final product containing all the transcripts assembled by Trinity
Trinity.fasta.transdecoder.pep: Most likely Longest-ORF peptide candidates generated from the Trinity Assembly. Instructions for generation of this file can be found here: http://trinityrnaseq.sourceforge.net/analysis/extract_proteins_from_trinity_transcripts.html
Note
|
Transdecoder is included in Trinity at $TRINITY_HOME/trinity-plugins/transdecoder/ ; Newer versions of Trinotate will generate a file Trinity.fasta.transdecoder.pep. Earlier versions will generate an equivalent file called best_candidates.eclipsed_orfs_removed.pep. These should be treated as equivalent outputs. |
2. Capturing BLAST Homologies

BLAST information Instructions for installation of command line stand alone blast can be found here: http://www.ncbi.nlm.nih.gov/books/NBK52640/ NOTE: This step will undoubtedly take the longest, for very large files execution on a multi-cpu server HPC environment is highly recommended, and your thread count should be equal to the number of CPU’s present on the node the job is run on.
Blast Commands |
---|
blastp -query best_candidates.eclipsed_orfs_removed.pep -db SwissProtFormated -num_threads 8 -max_target_seqs 1 -outfmt 6 > TrinotateBlast.out |
Note
|
num_threads should be equal to the amount of cores your computer / computaional node has |
3. Running HMMER to identify protein domains
hmmscan (HMMER) command: |
---|
hmmscan --cpu 8 --domtblout TrinotatePFAM.out PfamA26.hmm best_candidates.eclipsed_orfs_removed.pep > pfam.log |
Note
|
num_threads should be equal to the amount of cores your computer/computaional node has |
4. Running signalP to predict signal peptides
signalP command: |
---|
signalp -f short -n signalp.out Trinity.fasta.transdecoder.pep |
5. Running tmHMM to predict transmembrane regions
tmhmm command: |
---|
tmhmm --short < best_candidates.pep > tmhmm.out |
Trinotate: Loading Above Results into a Trinotate SQLite Database
The following commands will import the results from the bioinformatic computes performed above into a Trinotate SQLite database, which is used for generating the final summary output file, and which we expect to support some additional downstream applications under development.
Note, the Trinotate.pl script can be found in the Trinity software package as:
$TRINITY_HOME/Analysis/FunctionalAnnotation/Trinotate.pl
usage: /Users/bhaas/SVN/trinityrnaseq/trunk/Analysis/FunctionalAnnotation/Trinotate.pl <command> <input> [...]
<command>: LOAD_transdecoder
LOAD_blast
LOAD_pfam
LOAD_tmhmm
LOAD_signalp
report
1. Retrieve the Trinotate Pre-generated Resource SQLite database
A pregenerated sqlite database that contains swissprot-related annotation information is available from the Trinity ftp site. Pull it down like so:
wget "http://sourceforge.net/projects/trinityrnaseq/files/misc/TRINOTATE_RESOURCES/TrinityFunctional.swissprot.2012-02-13.db.gz/download"
Rename it like so:
uncompress it and rename it as’TrinityFunctional.db'
gunzip TrinityFunctional.swissprot.2012-02-13.db.gz
mv TrinityFunctional.swissprot.2012-02-13.db TrinityFunctional.db
2. Load the Transdecoder Trinity Peptide Predictions
Trinotate.pl LOAD_transdecoder Trinity.fasta.transdecoder.pep
3. Loading BLAST homologies
Trinotate.pl LOAD_blast TrinotateBlast.out
4. Load Pfam domain entries
Trinotate.pl LOAD_pfam TrinotatePFAM.out
5. Load transmembrane domains
Trinotate.pl LOAD_tmhmm tmhmm.out
6. Load signal peptide predictions
Trinotate.pl LOAD_signalp signalp.out
Trinotate: Output an Annotation Report
Trinotate.pl report [opts] > trinotate_annotation_report.xls
Note, you can threshold the blast and pfam results to be reported by including the options below:
##################################################################
#
# -E <float> maximum E-value for reporting best blast hit
# and associated annotations.
#
# --pfam_cutoff <string> 'DNC' : domain noise cutoff (default)
# 'DGC' : domain gathering cutoff
# 'DTC' : domain trusted cutoff
# 'SNC' : sequence noise cutoff
# 'SGC' : sequence gathering cutoff
# 'STC' : sequence trusted cutoff
#
##################################################################
The output has the following column headers:
0 #component
1 trans_derived
2 prot_id
3 TopBlastHit
4 Pfam
5 SignalP
6 TmHMM
7 eggnog
8 gene_ontology
9 prot_seq
and the data are formatted like so:
# example 1
0 comp2700_c0
1 comp2700_c0_seq1:2-4099(+)
2 m.1814
3 sp|Q8E5V5|GLGA_STRA3`Q8E5V5`Q:270-720,H:15-474`25.26%ID`E:2e-17`RecName: Full=Glycogen synthase; EC=2.4.1.21; AltName: Full=Starch [bacterial glycogen] synthase;`Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus.
4 PF08323.6^Glyco_transf_5^Starch synthase catalytic domain^253-458^E:9.5e-18`PF00534.15^Glycos_transf_1^Glycosyl transferases group 1^537-659^E:1.7e-09
5 sigP:1`18`0.816`YES
6 ExpAA=279.46`PredHel=13`Topology=o172-194i932-954o969-991i996-1018o1028-1050i1062-1084o1104-1126i1152-1174o1189-1208i1221-1243o1258-1280i1287-1309o1335-1357i
7 COG0297^ Glycogen synthase
8 GO:0009011^molecular_function^starch synthase activity`GO:0005978^biological_process^glycogen biosynthetic process
9 KKSFLFQFLFGWIVLSSAQWLSVLDEAENLNSSFSLESVDSFAPVRPRFIIDEDFAEDYNLTVDILHRPLQENFDSFFPNVEAYVESGNSNGDLMSDNGKLDDLNSRAAYSALKALQNSYGSSHLYRFTPYELFGQSIWIEEAPEVNHVGWSIMFDNLGRYFLLELRGLREVTFALFITFSIVPIITGILTVYIYKKKYCVIKFNKSGRSKKKDSWLKRSKDELLRTDSANLLTLNDNDEPVMIRHSCKRTCILFATLEYNIPEWNIKIKIGGLGVMAELMSKTLKQYDLVWVVPCVGDITYPVAETAPSLVVKVVNQDYEVKVFYHYKDNIKYVLLDSPIFRKRTSHDPYPPRMDDISSAIFYSVWNQSIAAIIRREKVDIYHINDYHGALAPIYNLPEVIPCAISLHNAEFQGLWPLQSSIDEREVCGLFNVSKTICREYIQFGNAFNLMHCGVSYVRRHQSGYGVVGVSNKYGQRSWVRYPVFWSLKKIGQLPNPDPTDIGLSVNPVNQQLPDFAEYASVRKENKRKAQEWAGLTIDDEADLLVFVGRWSVQKGIDILADLAPTLLEKFNIQLIVVGPLIDLYGKFAAEKFMYIMERYPGRVFSKPEFVHLPPFIFEGADFALIPSRDEPFGLVAVEFGRRGAICIGSRVGGLGEMPGWWYSVESSSTAYLLKQLEKSCTLALKSTPEMRHKLRIAALQQRFPVDEWVALYDRLIRNCIKAHNKQQQRRSIKSFFSCITPNKPTKDVNDILSEKSAFSPADYEHSIDIREHTSYDANSMDNDSDEDNYEQAESIISSLSSSALSELSYISESSMNIGSRLDERFIDANGVAIRDFSAELTYLTPENSKGKLSIDHFLNKVQSRWHDEEHHYYKTGFRKRVYKYLKIKDKKSKDVDPDDDLVNQLPLNAYTKPRYKSAASTRLNIYQRILYLKVFTWPLYTIFLSLGQILSISSFQLSLLSGFEDNNQISLYVITGVFILSTIVWWGLYRNLPSVHSLSLPFAVYALAFLFTGISSMSLPYHIRGWLSYAATWVYAIAAASGPLYFTLNFEDEHCSGLGSSITRACVLQGVQQLWLSFLWLWGTLSSRLDYNYKVLLQPINSVYVVAGVWPVSFVLLSVCILLYKGLPPFYRQKPGSIPAFSKSLLHRKVVICFLISVINQNFWMSTLISQAWRFFWGSKLTKLWKIVVMTVSFLVGAWLIIFYVLRKLSNKHTWMVPVLGLGFGAIKWMHVFWGTSNVGIFLPWAGIAGPYLSRALWLWLGILDSIQGIGNGLILLQTLSRRHVTNTLMISQLAGSATSILARFVSPTKTGPANVFPDLTGYTPVDRAKPVANAPFWICLILNVALCIMYLRCYHRENISRP*
# example 2
0 comp1507_c0
1 comp1507_c0_seq1:405-1415(+)
2 m.772
3 sp|Q7Z8R5|PALI_YARLI`Q7Z8R5`Q:1-236,H:3-234`30.13%ID`E:3e-21`RecName: Full=pH-response regulator protein palI/RIM9;`Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Dipodascaceae; Yarrowia.
4 PF06687.7^SUR7^SUR7/PalI family^7-165^E:2.4e-18
5 sigP:1`23`0.564`YES
6 ExpAA=91.29`PredHel=4`Topology=i9-31o89-111i118-140o150-172i
7 NOG12793^ Calcium ion binding protein
8 GO:0016021^cellular_component^integral to membrane`GO:0005886^cellular_component^plasma membrane
9 MRIRSATPSLILLVIAIVFFVLAICTPPLANNLTLGKYGDVRFGVFGYCLNSNCSKPLVGYNSDYLDEHAKDGFRTSVIVRQRASYGLVIVPVSACICLISTIMTIFAHIGAIARSPGFFNVIGTITFFNIFITAIAFVICVITFVPHIQWPSWLVLANVGIQLIVLLLLLVARRQATRLQAKHLRRATSGSLGYNPYSLQNSSNIFSTSSRKGDLPKFSDYSAEKPMYDTISEDDGLKRGGSVSKLKPTFSNDSRSLSSYAPTVREPVPVPKSNSGFRFPFMRNKPAEQAPENPFRDPENPFKDPASAPAPNPWSINDVQANNDKKPSRFSWGRS*
Backticks (`) and carets (^) are used as delimiters for data packed within an individual field, such as separating E-values, percent identity, and taxonomic info for best matches. When there are multiple assignments in a given field, the assignments are separated by (\`) and the fields within an assignment are separated by (\^). In a future release (post Feb-2013), the backticks and carets will be used more uniformly than above, such as carets as BLAST field separators, and including more than the top hit.
Trinotate: Sample data and execution
Sample data and a runMe.sh script are available at $TRINITY_HOME/Analysis/FunctionalAnnotation/sample_data
Executing the runMe.sh script will pull down the Trinotate sqlite database, populate with the provided bioinformatics computes, and generate the final Trinotate annotation report.
Literature references for software used for functional annotation
-
[Trinity]Full-length transcriptome assembly from RNA-Seq data without a reference genome. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Nature Biotechnology 29, 644â652 (2011)
-
[HMMER]HMMER web server: interactive sequence similarity searching R.D. Finn, J. Clements, S.R. Eddy Nucleic Acids Research (2011) Web Server Issue 39:W29-W37.
-
[PFAM] The Pfam protein families database Punta, P.C. Coggill, R.Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, Ceric, J. Clements, A. Heger, L. Holm, E.L.L. Sonnhammer, S.R. Eddy, A. Bateman, R.D. Finn Nucleic Acids Research (2012) Database Issue 40:D290-D301
-
[SignalP]SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, Soren Brunak, Gunnar von Heijne & Henrik Nielsen Nature Methods, 8:785-786, 2011
-
[tmHMM]Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. J Mol Biol. 2001 Jan 19;305(3):567-80.
-
[BLAST]Basic local alignment search tool. Altschul SF; Gish W; Miller W; Myers EW; Lipman DJ J Mol Biol 215: 403-10 (1990)
-
[KEGG]KEGG for integration and interpretation of large-scale molecular datasets. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., and Tanabe, M.; Nucleic Acids Res. 40, D109-D114 (2012).
-
[GO]Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium Nature Genet. 25: 25-29 (2000)
-
[eggNOG]eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, von Mering C, Bork P. Nucleic Acids Res. 2012 Jan;40(Database issue):D284-9. Epub 2011 Nov 16.