Differences

This shows you the differences between two versions of the page.

--- automatically_fixing_gene_models [2024/07/02 10:51] – [Necessary Input Files] 134.190.232.164
+++ automatically_fixing_gene_models [2024/07/16 11:07] (current) – [Running fix_genes_with_false_introns.py] 134.190.232.164
@@ Line 18: / Line 18: @@
 This script makes use of python packages:
-- gffutils to read in, handle and output GFF3 files
-- pysam to read in and handle BAM files
+  * gffutils to read in, handle and output GFF3 files
-- biopython to read in and handle FASTA files
+  * pysam to read in and handle BAM files
-- orfipy to predict ORFs
+  * biopython to read in and handle FASTA files
+  * orfipy to predict ORFs
 If you are planning to run this on your local computer, perhaps the easiest way to get it to run is to install all dependencies using CONDA or MAMBA:
 <code>
-conda create -n fix_genes_script -c bioconda gffutils pysam biopython orfipy
+conda create -n fix_genes_script -c bioconda python=3.9 gffutils pysam biopython orfipy
 </code>
-If you want to run it on PERUN, you'd need to find an environment that has these dependencies installed as well. As of right now, the 'find-supported-orfs' environment works.
+If you want to run it on PERUN, you'd need to find an environment that has these dependencies installed as well. As of right now, the ''find-supported-orfs'' environment works.
+Now activate the environment.
+If you've installed it locally, run
+''conda activate fix_genes_script'' or whatever you've decided to name the environment.
+If you're on Perun, run
+''source activate find-supported-orfs''
 ==== Necessary Input Files ====
@@ Line 64: / Line 72: @@
 i.attributes['ID'][0] = sorted_exon_names[0].replace('exon','intron')
 </code>
+The PERUN environment ''find-supported-orfs'' should have 0.12, so you'll need to adjust the code.
+==== Running fix_genes_with_false_introns.py ====
+Running the script is fairly straightforward.
+If you are running it locally, make sure you have you are in the right environment, and then run
+<code>
+fix_genes_with_false_introns.py \
+    -g <your_GFF3_file_with_explicit_intron_features> \
+    -b <your_BAM_file> \
+    -f <your_FASTA_file> \
+    > <output_GFF3_file>
+</code>
+If you want to run it on Perun, you could simply login to one of the compute nodes (for example Perun21): ''ssh perun21'' and activate the ''find-supported-orfs'' environment and run it as if you were running it locally. You can also run it using a shell submission script:
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+#$ -cwd
+#$ -m bea
+#$ -M joran.martijn@dal.ca
+source activate find-supported-orfs
+fix_genes_with_false_introns.py \
+    -g <your_GFF3_file_with_explicit_intron_features> \
+    -b <your_BAM_file> \
+    -f <your_FASTA_file> \
+    > <output_GFF3_file>
+conda deactivate
+</code>
+And that's it! It's not the fastest script in the world but should take anywhere between a couple of hours and a day, depending on the size of your genome and BAM file. I'd recommend testing the script out on a smaller GFF3 file, for example one that contains features of only one contig, or a subset of a contig.
+Each newly created gene in the output GFF3 file will have in its source field ''fix_genes.py''. So, if you'd like to just see the newly created genes, you can extract them using ''awk '$2=="fix_genes.py"' <output_GFF3_file> > only_new_genes.gff3''. I'd recommend loading the original GFF3, along with the entire updated GFF3, and a GFF3 with just the new genes in IGV. This will allow you to see whether the newly updated genes actually make sense.
+Hopefully they do! If they don't, contact me (Joran) or try to see if you can update the code yourself. So far I've tested the script on Ergobibamus cyprinoides (which has a relatively simple genome structure) and Meteora (which is a bit more complicated). I already had to adjust the code quite a bit to yield sensible results with Meteora so chances are it may not work very well with another genome.
+Remember that this script will not always yield correct genes. It is very simple and looks for just the longest ORFs in a defined region (while respecting supported introns, of course). Sometimes a shorter version of an ORF actually corresponds to a gene. It's up to you to detect these and curate them by hand. This script just does a lot of heavy lifting for you, but won't be perfect.