Differences

This shows you the differences between two versions of the page.

--- gene_prediction_curation_with_igv [2023/02/02 11:35] – [Curation example] 134.190.232.140
+++ gene_prediction_curation_with_igv [2025/08/07 16:01] (current) – 134.190.145.228
@@ Line 3: / Line 3: @@
 Joran Martijn (February 2023)
-After applying one or more gene prediction algorithms on your genome, you would do well to manually curate the predicted gene models manually. Even the most advanced algorithms to date can still make mistakes that are otherwise obvious to the human eye.
+After applying one or more gene prediction algorithms on your genome, you would do well to curate the predicted gene models manually. Even the most advanced algorithms to date can still make mistakes that are otherwise obvious to the human eye.
 The curation process described here aims to compile all the necessary information in a single screen in the graphical user interface of the Integrated Genomics Viewer ([[https://software.broadinstitute.org/software/igv/|IGV]]), which allows you to edit genes one by one if deemed necessary.
@@ Line 26: / Line 26: @@
 Note that here two distinct RNAseq tracks are loaded. One for all the reads stemming from transcripts transcribed from the positive strand of the genome, and one for all those transcribed from the negative strand of the genome. This kind of strand-specificity information is only available when a RNAseq library protocol was used to retains this information, for example the dUTP method. I believe nowadays this is the default so it should be available to you.
-If you used HISAT2 to map your RNAseq reads to your genome, all "positive" reads will have the tag ''XS:A:+'', while all "negative" reads will have the ''XS:A:-'' tag. One can separate the two types of reads from the HISAT2-generated BAM file using the following command:
+If you used HISAT2 with ''--rna-strandness RF'' to map your RNAseq reads to your genome, all "positive" reads will have the tag ''XS:A:+'', while all "negative" reads will have the ''XS:A:-'' tag. One can separate the two types of reads from the HISAT2-generated BAM file using the following command:
 <code>
@@ Line 35: / Line 35: @@
 samtools index <rnaseq.neg.bam> <rnaseq.neg.bam.bai>
 </code>
+==== GC content ====
+It may be nice to have a track in IGV that shows you per-window-size %GC. Here I generate a so-called ''bedGraph'' formatted file that can be loaded into IGV
+<code>
+cat Ergobibamus_cyprinoides_CL.scaffolds.fa \
+    | seqkit sliding -W 50 -s 25 \
+    | seqkit fx2tab -n -g \
+    | sed -e "s/_sliding\:/\t/" -e "s/\-/\t/" -e "s/\s+/\t/g" \
+    > ergo_gc_content.bedgraph
+</code>
+You can play with the window size ''-W'' and overlap ''-s'' parameters to fine tune it to your own liking. You can either load in this track via ''File -> Load From File ..'' or through the JSON file (see below)
 ==== Loading into IGV ====
@@ Line 156: / Line 170: @@
             "url": "rnaseq_vs_masked_ergo_cyp_genome.sort.neg_transcripts.bam",
             "indexURL": "rnaseq_vs_masked_ergo_cyp_genome.sort.neg_transcripts.bam.bai"
+        },
+        {
+            "name": "GC content",
+            "type": "wig",
+            "format": "bedGraph",
+            "url": "ergo_gc_content.bedgraph",
+            "graphType": "bar",
+            "min": 20,
+            "max": 80,
+            "color": "#00dad7"
         }
     ]
 }
 </code>
+Load a JSON file with ''Genomes->Load Genome from File...''. If for whatever reason your file won't load, you can check your ''~/igv/igv0.log'' and ''~/igv/igv1.log'' log files. They may have some hints on why it doesn't want to load.
 This has some advantages:
@@ Line 175: / Line 202: @@
 "name": Desired display name for your track
 "type": Type of track. Could be "alignment", "annotation" and perhaps other types
-"format" : "bam" or "gff3"
+"format" : "bam" or "gff3" or "wig"
 "url" : filename of the file you wish to load here
 "fastaURL" : filename of the genome FASTA file
@@ Line 186: / Line 213: @@
 To color the RNAseq reads according to strandness, right-click on the track, then ''Color alignments by->first-of-pair strand''
-It may be useful to set a fixed y-scale of the coverage track. To do so, right-click on the coverage track, then ''Set Data Range'', and enter your desired maximum value. I like 1000
+It may be useful to set a fixed y-scale of the coverage track. To do so, right-click on the coverage track, then ''Set Data Range'', and enter your desired maximum value. I like 1000. To make the both tracks (positive and negative transcript coverage) comparable, set the same maximum value for both.
@@ Line 202: / Line 229: @@
 For some frustratingly unknow reason, IGV only displays the three forward frames, OR the three reverse frames. It is, as far as I know, not possible to display all six frames simultaneously. To swap between reverse and forward frames, click on small arrow of the //Sequence// track. To see the sequence track, you need to be sufficiently zoomed in.
 ==== Curating gene models ====
@@ Line 258: / Line 286: @@
   * GFF3 files are tab-delimited, and consist of 9 columns
   * Each line defines 1 feature. For example, gene, mRNA, exon, CDS, intron, etc
+The 9 columns are as follows:
 ^ Column      ^ Description                                                ^
@@ Line 270: / Line 300: @@
 | Attributes  | ''tag=value; tag=value;'' See below                        |
+**Phase** is not the same as the reading frame. In short,
+  * ''0'' means that the first available codon starts on the 1st base of the feature
+  * ''1'' means that the first available codon starts on the 2nd base of the feature
+  * ''2'' means that the first available codon starts on the 3rd base of the feature
+Thus, the CDS corresponding to the first exon //always// has phase 0. CDSs of subsequent exons can have phases 0, 1 or 2.
+For example, if the second exon of a gene has sequence ''CGTTACAGAGC...'', and the CDS has phase 1, the first codon of this sequence starts at the G, i.e. ''C-GTT-ACA-GAG-C...''. However, if it had phase 2, it would be ''CG-TTA-CAG-AGC''.
+The **Attributes** field contains all other information regarding the feature and has a number of ''tag=value;'' fields. Most common fields are the ''ID'', ''Name'' and ''Parent''.
+As the name implies, IDs are identifiers of your features. An ID of a feature **must be unique** within the entire GFF3 file. Of special consideration are //discontinuous// features. These are features that described by multiple lines in the GFF3 file and are in distinct regions on the chromosomes. In our case, ''CDS'' features are like this if there are multiple exons in a gene. Consider the following example:
+<code>
+tig00000015 funannotate gene    617629  617991  .   -   .   ID=ctg015.gene001388-2f;Name=gene001388-2f;
+tig00000015 funannotate mRNA    617629  617991  .   -   .   ID=ctg015.mRNA001388-2f;Parent=ctg015.gene001388-2f;Name=mRNA001388-2f
+tig00000015 funannotate CDS 617629  617785  .   -   1   ID=ctg015.mRNA001388-2f.cds;Parent=ctg015.mRNA001388-2f
+tig00000015 funannotate exon    617629  617785  .   -   .   ID=ctg015.mRNA001388-2f.exon03;Parent=ctg015.mRNA001388-2f
+tig00000015 funannotate CDS 617835  617886  .   -   2   ID=ctg015.mRNA001388-2f.cds;Parent=ctg015.mRNA001388-2f
+tig00000015 funannotate exon    617835  617886  .   -   .   ID=ctg015.mRNA001388-2f.exon02;Parent=ctg015.mRNA001388-2f
+tig00000015 funannotate CDS 617937  617991  .   -   0   ID=ctg015.mRNA001388-2f.cds;Parent=ctg015.mRNA001388-2f
+tig00000015 funannotate exon    617937  617991  .   -   .   ID=ctg015.mRNA001388-2f.exon01;Parent=ctg015.mRNA001388-2f
+</code>
+All CDS features here point to the same CDS. This makes sense, as a CDS is biologically speaking a single object. They thus all have the same ID (''ctg015.mRNA001388-2f.cds''), but they are in distinct regions (''617629-617785'', ''617835-617886'', ''617937-617991'') and are hence //discontinuous//.
+The **Name** attribute simply determines the display name of the feature in a genome viewer like IGV. Names do not need to be unique.
+**Parent** attributes describe the hierarchical relationship of features. For example, a ''gene'' feature is considered a parent of an ''mRNA'' feature, which in turn is a parent of ''CDS'' and ''exon'' features. Such hierarchical relationships tell us which mRNA features belong to which gene features, and so on.
+The '';'', ''='', ''&'' and '','', as well as TAB, newline, carriage return and percent signs have special meanings in the attribute fields. They should therefore be avoided as much as possible in the //value// part of the field. If they must be included, replace these characters with percent encoding signs.
+^ Special character ^ Percent encoding sign ^
+| '';''             | %3B |
+| ''=''             | %3D |
+| ''&''             | %26 |
+| '',''             | %2C |
+| TAB               | %09 |
+| newline           | %0A |
+| carriage return   | %0D |
+| ''%''             | %25 |
+==== Validating curated gene models ====
+We are only human.
+Despite our best efforts to curate errors made by the automated gene prediction tools, during the curation process, we humans are prone to make our own mistakes! To help detect at least some of these unintentional oversights, I wrote a script, ''validate_gene_models_in_gff3.py''.
+It will do some sanity checks on all gene models in the GFF3 file:
+  * Does the gene feature have any children features?
+  * Do gene coordinates match with mRNA & tRNA coordinates?
+  * Does the start coordinate of the first CDS/exon match with the gene start?
+  * Does the end coordinate of the last CDS/exon match with the gene end?
+  * Is the cumulative length of all exons a multiple of 3?
+  * Does the translate CDS start with an M and end with a * (STOP)?
+  * Are there any premature * (STOP) in the CDS?
+  * If Blastocystis genome, and there is no STOP, does it end with a TGTTTGTT motif?
+<code>
+validate_gene_models_in_gff3.py -g Ergobibamus_cyprinoides_CL.recurated.gff3 -f Ergobibamus_cyprinoides_CL.scaffolds.fa
+</code>
+If everything is OK, it will report ''Gene models in provided GFF3 passed all checks!''
+If a certain feature did not pass a check, the script will prematurely terminate and report an error message. For example ''ValueError: The end coordinate of ctg012.gene0001-1 does not correspond to the last CDS end coordinate''. Now you know where to go in your GFF3 file and fix the problem! After fixing it, rerun the tool.
+You can also optionally print all protein CDS sequences in FASTA format with the ''--print_proteins'' option:
+<code>
+validate_gene_models_in_gff3.py -g Ergobibamus_cyprinoides_CL.recurated.gff3 -f Ergobibamus_cyprinoides_CL.scaffolds.fa --print_proteins > ergo_proteins.fasta
+</code>
+<del>NOTE: Currently, the checks performed in the script assume that you haven't annotated any UTRs in your GFF3. If you have UTR (''three_prime_UTR'' or ''five_prime_UTR'') features, for example the start coordinate of the first CDS will not match the start coordinate of the gene. It will thus throw an error, even though your GFF3 is perfectly fine. Something I'll have to fix in the future.</del>
+UPDATE (Jan 2025): It should now account for UTRs
+==== Re-streamlining gene names ====
+During your curation, you may be forced to make new gene names that stray from the original format. For example, if you found that gene ''FUN_000012'' was actually two genes that were wrongly fused, you may have split them up into ''FUN_000012-1'' and ''FUN_000012-2''. This is perfectly fine, but may cause problems if you want to feed the curated GFF3 as input to for example ''funannotate annotate'' or ''funannotate update''.
+To re-streamline gene names, you can make use of the ''funannotate util gff-rename'' utility:
+<code>
+funannotate util gff-rename \
+    --gff3 Ergobibamus_curated.gff3 \
+    --fasta Ergobibamus_contigs.fasta \
+    --locus_tag PYV62 \
+    --out Ergobibamus_curated_renamed.gff3
+</code>
+Note that I have also added a new locus tag here, ''PYV62''. This was the locus tag assigned to me by the NCBI genome submission portal.