Trinotate image

Background: Trinotate is a comprehensive annotation suite designed for automatic functional annotation of de novo Transcriptome assemblies from novel organisms or model organsims created using the Trinity assembly program. Trinotate makes use of a number of different well refernced methods for functional annotation including homology search to known sequence data (NCBI-BLAST), protein domain identification (HMMER/PFAM), protein signal prediction (singalP/tmHMM), and comparison to currently currated annotation databases (EMBL Uniprot eggNOG/GO Pathways databases). All functional annotation data derived from the analysis of a Trinity Transcriptome Assembly is integrated into a sqlLITE database which allows fast efficient searching for terms with specific qualities related to a desired scientific hypothesis or a means to create a whole annotation report for a transcriptome.

Trinotate is included in the Trinity package under $TRINITY_HOME/Analysis/FunctionalAnnotation/.

1. Table of Contents

Software and Data Required

1. Software Required

Trinity (Trinotate is bundled with the distribution): http://trinityrnaseq.sourceforge.net/

sqlite (required for database integration): http://www.sqlite.org/

NCBI Blast: Blast database Homology Search: http://www.ncbi.nlm.nih.gov/books/NBK52640/

HMMER/PFAM Protein Domain Identification: http://hmmer.janelia.org/software

signalP v4 (free academic download) http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp

You should edit the following line to read like so, increasing the max number of entries that can be processed:
my $MAX_ALLOWED_ENTRIES=2000000;  # default is only 2000

tmhmm v2 (free academic download) http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm

You might need to edit the header lines of the scripts 'tmhmm' and 'tmhmmformat.pl' to read:
#!/usr/bin/env perl

2. Sequence Databases Required

be sure the search database is properly formatted by running the following (requires that blast+ is already installed as indicated above)
makeblastdb -in uniprot_trembl.fasta -dbtype prot
If you don't have the Pfam database installed, be sure to download, uncompress it, and prepare it for use with 'hmmscan' like so:
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm

Running Sequence Analyses

1. Trinity files needed for execution

Trinity.fasta - Final product containing all the transcripts assembled by Trinity

Trinity.fasta.transdecoder.pep: Most likely Longest-ORF peptide candidates generated from the Trinity Assembly. Instructions for generation of this file can be found here: http://trinityrnaseq.sourceforge.net/analysis/extract_proteins_from_trinity_transcripts.html

Note
Transdecoder is included in Trinity at $TRINITY_HOME/trinity-plugins/transdecoder/ ; Newer versions of Trinotate will generate a file Trinity.fasta.transdecoder.pep. Earlier versions will generate an equivalent file called best_candidates.eclipsed_orfs_removed.pep. These should be treated as equivalent outputs.

2. Capturing BLAST Homologies

Blast image

BLAST information Instructions for installation of command line stand alone blast can be found here: http://www.ncbi.nlm.nih.gov/books/NBK52640/ NOTE: This step will undoubtedly take the longest, for very large files execution on a multi-cpu server HPC environment is highly recommended, and your thread count should be equal to the number of CPU’s present on the node the job is run on.

Blast Commands

blastp -query best_candidates.eclipsed_orfs_removed.pep -db SwissProtFormated -num_threads 8 -max_target_seqs 1 -outfmt 6 > TrinotateBlast.out

Note
num_threads should be equal to the amount of cores your computer / computaional node has

3. Running HMMER to identify protein domains

HMMER image
hmmscan (HMMER) command:

hmmscan --cpu 8 --domtblout TrinotatePFAM.out PfamA26.hmm best_candidates.eclipsed_orfs_removed.pep > pfam.log

Note
num_threads should be equal to the amount of cores your computer/computaional node has

4. Running signalP to predict signal peptides

SIGNALP image
signalP command:

signalp -f short -n signalp.out Trinity.fasta.transdecoder.pep

5. Running tmHMM to predict transmembrane regions

TMHMM image
tmhmm command:

tmhmm --short < best_candidates.pep > tmhmm.out

Trinotate: Loading Above Results into a Trinotate SQLite Database

The following commands will import the results from the bioinformatic computes performed above into a Trinotate SQLite database, which is used for generating the final summary output file, and which we expect to support some additional downstream applications under development.

Note, the Trinotate.pl script can be found in the Trinity software package as:

$TRINITY_HOME/Analysis/FunctionalAnnotation/Trinotate.pl
usage: /Users/bhaas/SVN/trinityrnaseq/trunk/Analysis/FunctionalAnnotation/Trinotate.pl <command> <input> [...]
<command>:  LOAD_transdecoder
            LOAD_blast
            LOAD_pfam
            LOAD_tmhmm
            LOAD_signalp
            report

1. Retrieve the Trinotate Pre-generated Resource SQLite database

A pregenerated sqlite database that contains swissprot-related annotation information is available from the Trinity ftp site. Pull it down like so:

wget "http://sourceforge.net/projects/trinityrnaseq/files/misc/TRINOTATE_RESOURCES/TrinityFunctional.swissprot.2012-02-13.db.gz/download"

Rename it like so:

uncompress it and rename it as’TrinityFunctional.db'

gunzip TrinityFunctional.swissprot.2012-02-13.db.gz
mv TrinityFunctional.swissprot.2012-02-13.db TrinityFunctional.db

2. Load the Transdecoder Trinity Peptide Predictions

Trinotate.pl LOAD_transdecoder Trinity.fasta.transdecoder.pep

3. Loading BLAST homologies

Trinotate.pl LOAD_blast TrinotateBlast.out

4. Load Pfam domain entries

Trinotate.pl LOAD_pfam TrinotatePFAM.out

5. Load transmembrane domains

Trinotate.pl LOAD_tmhmm tmhmm.out

6. Load signal peptide predictions

Trinotate.pl LOAD_signalp signalp.out

Trinotate: Output an Annotation Report

Trinotate.pl report [opts] > trinotate_annotation_report.xls

Note, you can threshold the blast and pfam results to be reported by including the options below:

##################################################################
#
#  -E <float>                 maximum E-value for reporting best blast hit
#                             and associated annotations.
#
#  --pfam_cutoff <string>     'DNC' : domain noise cutoff (default)
#                             'DGC' : domain gathering cutoff
#                             'DTC' : domain trusted cutoff
#                             'SNC' : sequence noise cutoff
#                             'SGC' : sequence gathering cutoff
#                             'STC' : sequence trusted cutoff
#
##################################################################

The output has the following column headers:

0   #component
1   trans_derived
2   prot_id
3   TopBlastHit
4   Pfam
5   SignalP
6   TmHMM
7   eggnog
8   gene_ontology
9   prot_seq

and the data are formatted like so:

# example 1

0   comp2700_c0
1   comp2700_c0_seq1:2-4099(+)
2   m.1814
3   sp|Q8E5V5|GLGA_STRA3`Q8E5V5`Q:270-720,H:15-474`25.26%ID`E:2e-17`RecName: Full=Glycogen synthase; EC=2.4.1.21; AltName: Full=Starch [bacterial glycogen] synthase;`Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus.
4   PF08323.6^Glyco_transf_5^Starch synthase catalytic domain^253-458^E:9.5e-18`PF00534.15^Glycos_transf_1^Glycosyl transferases group 1^537-659^E:1.7e-09
5   sigP:1`18`0.816`YES
6   ExpAA=279.46`PredHel=13`Topology=o172-194i932-954o969-991i996-1018o1028-1050i1062-1084o1104-1126i1152-1174o1189-1208i1221-1243o1258-1280i1287-1309o1335-1357i
7   COG0297^ Glycogen synthase
8   GO:0009011^molecular_function^starch synthase activity`GO:0005978^biological_process^glycogen biosynthetic process
9   KKSFLFQFLFGWIVLSSAQWLSVLDEAENLNSSFSLESVDSFAPVRPRFIIDEDFAEDYNLTVDILHRPLQENFDSFFPNVEAYVESGNSNGDLMSDNGKLDDLNSRAAYSALKALQNSYGSSHLYRFTPYELFGQSIWIEEAPEVNHVGWSIMFDNLGRYFLLELRGLREVTFALFITFSIVPIITGILTVYIYKKKYCVIKFNKSGRSKKKDSWLKRSKDELLRTDSANLLTLNDNDEPVMIRHSCKRTCILFATLEYNIPEWNIKIKIGGLGVMAELMSKTLKQYDLVWVVPCVGDITYPVAETAPSLVVKVVNQDYEVKVFYHYKDNIKYVLLDSPIFRKRTSHDPYPPRMDDISSAIFYSVWNQSIAAIIRREKVDIYHINDYHGALAPIYNLPEVIPCAISLHNAEFQGLWPLQSSIDEREVCGLFNVSKTICREYIQFGNAFNLMHCGVSYVRRHQSGYGVVGVSNKYGQRSWVRYPVFWSLKKIGQLPNPDPTDIGLSVNPVNQQLPDFAEYASVRKENKRKAQEWAGLTIDDEADLLVFVGRWSVQKGIDILADLAPTLLEKFNIQLIVVGPLIDLYGKFAAEKFMYIMERYPGRVFSKPEFVHLPPFIFEGADFALIPSRDEPFGLVAVEFGRRGAICIGSRVGGLGEMPGWWYSVESSSTAYLLKQLEKSCTLALKSTPEMRHKLRIAALQQRFPVDEWVALYDRLIRNCIKAHNKQQQRRSIKSFFSCITPNKPTKDVNDILSEKSAFSPADYEHSIDIREHTSYDANSMDNDSDEDNYEQAESIISSLSSSALSELSYISESSMNIGSRLDERFIDANGVAIRDFSAELTYLTPENSKGKLSIDHFLNKVQSRWHDEEHHYYKTGFRKRVYKYLKIKDKKSKDVDPDDDLVNQLPLNAYTKPRYKSAASTRLNIYQRILYLKVFTWPLYTIFLSLGQILSISSFQLSLLSGFEDNNQISLYVITGVFILSTIVWWGLYRNLPSVHSLSLPFAVYALAFLFTGISSMSLPYHIRGWLSYAATWVYAIAAASGPLYFTLNFEDEHCSGLGSSITRACVLQGVQQLWLSFLWLWGTLSSRLDYNYKVLLQPINSVYVVAGVWPVSFVLLSVCILLYKGLPPFYRQKPGSIPAFSKSLLHRKVVICFLISVINQNFWMSTLISQAWRFFWGSKLTKLWKIVVMTVSFLVGAWLIIFYVLRKLSNKHTWMVPVLGLGFGAIKWMHVFWGTSNVGIFLPWAGIAGPYLSRALWLWLGILDSIQGIGNGLILLQTLSRRHVTNTLMISQLAGSATSILARFVSPTKTGPANVFPDLTGYTPVDRAKPVANAPFWICLILNVALCIMYLRCYHRENISRP*

# example 2

0   comp1507_c0
1   comp1507_c0_seq1:405-1415(+)
2   m.772
3   sp|Q7Z8R5|PALI_YARLI`Q7Z8R5`Q:1-236,H:3-234`30.13%ID`E:3e-21`RecName: Full=pH-response regulator protein palI/RIM9;`Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Dipodascaceae; Yarrowia.
4   PF06687.7^SUR7^SUR7/PalI family^7-165^E:2.4e-18
5   sigP:1`23`0.564`YES
6   ExpAA=91.29`PredHel=4`Topology=i9-31o89-111i118-140o150-172i
7   NOG12793^ Calcium ion binding protein
8   GO:0016021^cellular_component^integral to membrane`GO:0005886^cellular_component^plasma membrane
9   MRIRSATPSLILLVIAIVFFVLAICTPPLANNLTLGKYGDVRFGVFGYCLNSNCSKPLVGYNSDYLDEHAKDGFRTSVIVRQRASYGLVIVPVSACICLISTIMTIFAHIGAIARSPGFFNVIGTITFFNIFITAIAFVICVITFVPHIQWPSWLVLANVGIQLIVLLLLLVARRQATRLQAKHLRRATSGSLGYNPYSLQNSSNIFSTSSRKGDLPKFSDYSAEKPMYDTISEDDGLKRGGSVSKLKPTFSNDSRSLSSYAPTVREPVPVPKSNSGFRFPFMRNKPAEQAPENPFRDPENPFKDPASAPAPNPWSINDVQANNDKKPSRFSWGRS*

Backticks (`) and carets (^) are used as delimiters for data packed within an individual field, such as separating E-values, percent identity, and taxonomic info for best matches. When there are multiple assignments in a given field, the assignments are separated by (\`) and the fields within an assignment are separated by (\^). In a future release (post Feb-2013), the backticks and carets will be used more uniformly than above, such as carets as BLAST field separators, and including more than the top hit.

Trinotate: Sample data and execution

Sample data and a runMe.sh script are available at $TRINITY_HOME/Analysis/FunctionalAnnotation/sample_data

Executing the runMe.sh script will pull down the Trinotate sqlite database, populate with the provided bioinformatics computes, and generate the final Trinotate annotation report.

Literature references for software used for functional annotation