This is an old revision of the document!

Gene prediction curation with IGV

Joran Martijn (February 2023)

After applying one or more gene prediction algorithms on your genome, you would do well to manually curate the predicted gene models manually. Even the most advanced algorithms to date can still make mistakes that are otherwise obvious to the human eye.

The curation process described here aims to compile all the necessary information in a single screen in the graphical user interface of the Integrated Genomics Viewer (IGV), which allows you to edit genes one by one if deemed necessary.

Example

An example view:

This example integrates the following source of information (top-to-bottom):

Location in the genome
RNAseq coverage levels
RNAseq intron junctions
The RNAseq reads mapped to the Ergobibamus genome sequence
The Ergobibamus genome sequence (visible when zooming in)
Repeats predicted by RepeatMasker
The gene models predicted by Funannotate, Braker2 and various GeneMark algorithms (not visible in the screenshot)

Strand-specific RNAseq reads

Note that here two distinct RNAseq tracks are loaded. One for all the reads stemming from transcripts transcribed from the positive strand of the genome, and one for all those transcribed from the negative strand of the genome. This kind of strand-specificity information is only available when a RNAseq library protocol was used to retains this information, for example the dUTP method. I believe nowadays this is the default so it should be available to you.

If you used HISAT2 to map your RNAseq reads to your genome, all “positive” reads will have the tag XS:A:+, while all “negative” reads will have the XS:A:- tag. One can separate the two types of reads from the HISAT2-generated BAM file using the following command:

samtools view -b --tag XS:+ <rnaseq.bam> > <rnaseq.pos.bam>
samtools index <rnaseq.pos.bam> <rnaseq.pos.bam.bai>

samtools view -b --tag XS:- <rnaseq.bam> > <rnaseq.neg.bam>
samtools index <rnaseq.neg.bam> <rnaseq.neg.bam.bai>

Loading into IGV

To load everything into a fresh IGV window, one can load each track in the old-fashioned way. First load your genome with Genomes→Load Genome from File…, and then load each individual track with File→Load from File…. This is fine doing once, but if your IGV crashes for whatever reason, or becomes a tangled web of tracks or you accidentally removed your Sequence track (these things happen more frequently than you want) you will need to restart IGV and load everything from scratch, collapse/expand and color tracks as you desire, etc, etc again. This is a huge pain in the ass!

The second way would be to load everything manually as before, but then use File→Save Session… and reload sessions after crashing and restarting IGV.

The third, and my preferred way, is to load everything at once using a JSON file. See an example file below:

{
    "id": "ergo",
    "name": "Ergobibamus cyprinoides CL",
    "fastaURL": "ergo_cyp_genome.fasta",
    "indexURL": "ergo_cyp_genome.fasta.fai",
    "chromosomeOrder": [
        "tig00000012",
        "tig00000492",
        "tig00000016",
        "tig00000015",
        "tig00000496",
        "tig00000019",
        "tig00000498",
        "tig00000497",
        "tig00000079",
        "tig00000031",
        "tig00000018",
        "tig00000013",
        "tig00000513",
        "tig00000510",
        "tig00000506",
        "tig00000491",
        "tig00000445",
        "tig00000507",
        "tig00000505",
        "tig00000509",
        "tig00000514",
        "tig00000511",
        "tig00000092",
        "tig00000512",
        "tig00000494",
        "tig00000493",
        "tig00000495",
        "tig00000508"
    ],
    "tracks": [
        {
            "name": "Repeats",
            "type": "annotation",
            "format": "gff3",
            "url": "repeats.gff3"
        },
        {
            "name": "Curated GFF3",
            "type": "annotation",
            "format": "gff3",
            "url": "01_curated_purged.gff3",
            "color": "#2986cc",
            "displayMode": "EXPANDED"
        },
        {
            "name": "Funannotate - Patched",
            "type": "annotation",
            "format": "gff3",
            "url": "funannotate_predict_patched.gff3",
            "color": "#ff8200"
        },
        {
            "name": "Funnotate - Update",
            "type": "annotation",
            "format": "gff3",
            "url": "funnotate.gff3",
            "color": "#009966",
            "displayMode": "EXPANDED"
        },
        {
            "name": "Braker2 - Patched",
            "type": "annotation",
            "format": "gff3",
            "url": "curated_purged.contigs_012_016_492.gff3",
            "color": "#ed48df"
        },
        {
            "name": "Braker2 - Simplified",
            "type": "annotation",
            "format": "gff3",
            "url": "EVM.raw.simple.gff3",
            "color": "#f1c232",
            "displayMode": "COLLAPSED"
        },
        {
            "name": "GeneMark-ES",
            "type": "annotation",
            "format": "gff3",
            "url": "genemark_ES.gtf",
            "color": "#000000",
            "displayMode": "COLLAPSED"
        },
        {
            "name": "GeneMark-ET",
            "type": "annotation",
            "format": "gff3",
            "url": "genemark_ET.gtf",
            "color": "#bcbcbc",
            "displayMode": "COLLAPSED"
        },

        {
            "name": "RNAseq +",
            "type": "alignment",
            "format": "bam",
            "url": "rnaseq_vs_masked_ergo_cyp_genome.sort.pos_transcripts.bam",
            "indexURL": "rnaseq_vs_masked_ergo_cyp_genome.sort.pos_transcripts.bam.bai"
        },
        {
            "name": "RNAseq -",
            "type": "alignment",
            "format": "bam",
            "url": "rnaseq_vs_masked_ergo_cyp_genome.sort.neg_transcripts.bam",
            "indexURL": "rnaseq_vs_masked_ergo_cyp_genome.sort.neg_transcripts.bam.bai"
        }
    ]

}

This has some advantages:

Load everything in one go
Set up a desired display order of contigs / chromosomes
Set up “clean” names for each track
Set up the colors for each track
Easy to share as all settings are in a plain text file
You could automatically generate such a file in a pipeline if you want

Unfortunately I haven't been able to get it to collapse / expand tracks. In theory it should work with “displayMode”: “COLLAPSED” or “displayMode”: “EXPANDED” but it doesn't seem to work. In addition, you would still need to collapse the RNAseq intron junction track and the coloring of the RNAseq reads manually.

"name": Desired display name for your track
"type": Type of track. Could be "alignment", "annotation" and perhaps other types
"format" : "bam" or "gff3"
"url" : filename of the file you wish to load here
"fastaURL" : filename of the genome FASTA file
"indexURL" : name of the indexed BAM file (BAI) or FASTA file (FAI)
"color" : hexadecimal code of color you wish to use for the track

Adjust graphical settings

To color the RNAseq reads according to strandness, right-click on the track, then Color alignments by→first-of-pair strand

It may be useful to set a fixed y-scale of the coverage track. To do so, right-click on the coverage track, then Set Data Range, and enter your desired maximum value. I like 1000

Navigating IGV

I highly recommend using hotkeys to navigate IGV as much as possible. Once you have selected a track on the left (by clicking on it), you can:

+        Zoom in
-        Zoom out
Ctrl+F   Move to the next gene feature
Ctrl+B   Move to the previous gene feature
Arrows   Move a bit to the left or a bit to the right

For some frustratingly unknow reason, IGV only displays the three forward frames, OR the three reverse frames. It is, as far as I know, not possible to display all six frames simultaneously. To swap between reverse and forward frames, click on small arrow of the Sequence track. To see the sequence track, you need to be sufficiently zoomed in.

Curating gene models

IGV is excellent for viewing your data, but unfortunately not so much for editing your files in place. My solution has been to have the GFF3 file with your gene predictions open in a separate window, and adjusting the coordinates / adding or removing introns and exons directly in the text file. To make your changes visible in IGV, you need to remove the track and reload it. To make editing GFF3 files easier, I strongly recommend organizing your GFF3 files such that there is an empty line between each gene.

For example:

tig00000016 EVM gene    1991975 1995790 .   -   .   ID=ctg016.gene0736;Name=gene0736
tig00000016 EVM mRNA    1991975 1995790 .   -   .   ID=ctg016.mRNA0736;Parent=ctg016.gene0736;Name=mRNA0736
tig00000016 EVM CDS 1991975 1995790 .   -   0   ID=ctg016.mRNA0736.CDS01;Parent=ctg016.mRNA0736
tig00000016 EVM exon    1991975 1995790 .   -   .   ID=ctg016.mRNA0736.exon01;Parent=ctg016.mRNA0736

tig00000016 Manual  gene    1995957 1996271 .   -   .   ID=ctg016.gene0737-1;Name=gene0737-1
tig00000016 Manual  mRNA    1995957 1996271 .   -   .   ID=ctg016.mRNA0737-1;Parent=ctg016.gene0737-1;Name=mRNA0737-1
tig00000016 Manual  CDS 1995957 1996271 .   -   0   ID=ctg016.mRNA0737-1.CDS;Parent=ctg016.mRNA0737-1
tig00000016 Manual  exon    1995957 1996271 .   -   .   ID=ctg016.mRNA0737-1.exon01;Parent=ctg016.mRNA0737-1

tig00000016 Manual  gene    1996590 1997048 .   -   .   ID=ctg016.gene0737-3;Name=gene0737-3
tig00000016 Manual  mRNA    1996590 1997048 .   -   .   ID=ctg016.mRNA0737-3;Parent=ctg016.gene0737-3;Name=mRNA0737-3
tig00000016 Manual  CDS 1996590 1997048 .   -   0   ID=ctg016.mRNA0737-3.CDS;Parent=ctg016.mRNA0737-3
tig00000016 Manual  exon    1996590 1997048 .   -   .   ID=ctg016.mRNA0737-3.exon01;Parent=ctg016.mRNA0737-3

Curation example

Below we see an example of a potential gene that was missed by the gene prediction algorithm on the reverse strand. By checking the sequence strand and carefully moving from potential exon to exon, we can see that there is a potential gene here.

Using the ruler line of IGV (toolbar on the top, the last button on the right), we can determine the right coordinates, and we add this gene in our curated GFF3 file:

tig00000015 funannotate gene    617629  617991  .   -   .   ID=ctg015.gene001388-2f;Name=gene001388-2f;
tig00000015 funannotate mRNA    617629  617991  .   -   .   ID=ctg015.mRNA001388-2f;Parent=ctg015.gene001388-2f;Name=mRNA001388-2f
tig00000015 funannotate CDS 617629  617785  .   -   1   ID=ctg015.mRNA001388-2f.cds;Parent=ctg015.mRNA001388-2f
tig00000015 funannotate exon    617629  617785  .   -   .   ID=ctg015.mRNA001388-2f.exon03;Parent=ctg015.mRNA001388-2f
tig00000015 funannotate CDS 617835  617886  .   -   2   ID=ctg015.mRNA001388-2f.cds;Parent=ctg015.mRNA001388-2f
tig00000015 funannotate exon    617835  617886  .   -   .   ID=ctg015.mRNA001388-2f.exon02;Parent=ctg015.mRNA001388-2f
tig00000015 funannotate CDS 617937  617991  .   -   0   ID=ctg015.mRNA001388-2f.cds;Parent=ctg015.mRNA001388-2f
tig00000015 funannotate exon    617937  617991  .   -   .   ID=ctg015.mRNA001388-2f.exon01;Parent=ctg015.mRNA001388-2f

Note that I didn't explicitly defined intron features here, only gene, mRNA, CDS and exon features. The introns are implied as the coordinate space between exons. You can also explicitly define them if you want. IGV will work with either formatting choice.

Reload the Curated GFF3 track with the updated GFF3 file:

Also zoom in until you see the protein sequence to make sure that you haven't accidentally introduced any frameshifts (usually visible by premature STOP codons)

GFF3 Format

To edit GFF3 files, you must have a good understanding of the format. GFF3 is an extension of the older GTF format and an attempt to standardize the gene annotation format. The official rules and guidelines can be found on their GitHub page: GFF3 Format. This is quite a long and extensive page. For our purposes, we'll mainly need to know the following things:

GFF3 files are tab-delimited, and consist of 9 columns
Each line defines 1 feature. For example, gene, mRNA, exon, CDS, intron, etc

The 9 columns are as follows:

Column	Description
SeqID	Name of contig, chromosome, etc
Source	Source of the feature. E.g. `genemark`, `braker`, etc
Type	`gene`, `mRNA`, `exon`, `intron`, `CDS` etc
Start	Start coordinate
End	End coordinate
Score	Some kind of score. `.` if not defined
Strand	`+` or `-`
Phase	`0`, `1`, or `2`. Only relevant for `CDS` features
Attributes	`tag=value; tag=value;` See below

Phase is not the same as the reading frame. In short, 0 means that the first available codon starts on the 1st base of the feature 1 means that the first available codon starts on the 2nd base of the feature 2 means that the first available codon starts on the 3rd base of the feature

Thus, the CDS corresponding to the first exon always has phase 0. CDSs of subsequent exons can have phases 0, 1 or 2.

For example, if the second exon of a gene has sequence CGTTACAGAGC…, and the CDS has phase 1, the first codon of this sequence starts at the G, i.e. C-GTT-ACA-GAG-C…. However, if it had phase 2, it would be CG-TTA-CAG-AGC

cgeb2001's DokuWiki!

Table of Contents