automatically_fixing_gene_models
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| automatically_fixing_gene_models [2024/07/02 10:25] – [Improving gene models with fix_genes_with_false_introns.py] 134.190.232.164 | automatically_fixing_gene_models [2024/07/16 11:07] (current) – [Running fix_genes_with_false_introns.py] 134.190.232.164 | ||
|---|---|---|---|
| Line 15: | Line 15: | ||
| The main aim of this script is thus NOT to be the final output of your gene prediction pipeline, but rather to shave off a large amount of time from your eventual manual curation of the gene models. | The main aim of this script is thus NOT to be the final output of your gene prediction pipeline, but rather to shave off a large amount of time from your eventual manual curation of the gene models. | ||
| - | ==== Example | + | ==== Installation |
| + | |||
| + | This script makes use of python packages: | ||
| + | |||
| + | * gffutils to read in, handle and output GFF3 files | ||
| + | * pysam to read in and handle BAM files | ||
| + | * biopython to read in and handle FASTA files | ||
| + | * orfipy to predict ORFs | ||
| + | |||
| + | If you are planning to run this on your local computer, perhaps the easiest way to get it to run is to install all dependencies using CONDA or MAMBA: | ||
| + | |||
| + | < | ||
| + | conda create -n fix_genes_script -c bioconda python=3.9 gffutils pysam biopython orfipy | ||
| + | </ | ||
| + | |||
| + | If you want to run it on PERUN, you'd need to find an environment that has these dependencies installed as well. As of right now, the '' | ||
| + | |||
| + | Now activate the environment. | ||
| + | |||
| + | If you've installed it locally, run | ||
| + | '' | ||
| + | |||
| + | If you're on Perun, run | ||
| + | '' | ||
| + | ==== Necessary Input Files ==== | ||
| + | |||
| + | You will need | ||
| + | |||
| + | * The GFF3 file containing all the gene models you wish to check | ||
| + | * The FASTA file containing the DNA sequences of the genome. Ensure that the FASTA headers / Sequence names match exactly those specified in the GFF3 file | ||
| + | * The BAM and BAM index file (typically .bam and .bam.bai files) that contain the mapped RNAseq reads. Also here ensure that the reference sequence names match those in the GFF3 and FASTA files. The indexed BAM file is not fed directly to the script, but the script will assume it to exist and to be in the same directory as the BAM file. | ||
| + | |||
| + | ==== Adding intron features to the GFF3 ==== | ||
| + | |||
| + | Often gene predictors like GeneMark, Braker2 and Funannotate will not explicitly report introns as distinct GFF3 features. They are instead implied. For example, if a gene feature is associated with at least two exon features, the regions in between the exon features are implied to be intron features. | ||
| + | |||
| + | My script however needs intron features to be explicitly written in the GFF3 file. So, if your file has implied but not explicit intron features, run this script first (also available on the [[https:// | ||
| + | |||
| + | < | ||
| + | add_intron_features.py -g < | ||
| + | </ | ||
| + | |||
| + | NOTE: the '' | ||
| + | |||
| + | < | ||
| + | conda list | grep gffutils | ||
| + | </ | ||
| + | |||
| + | If it is 0.11 or lower, you don't need to do anything. If it is 0.12 or higher, comment out these lines | ||
| + | |||
| + | < | ||
| + | sorted_exon_names = sorted( i.attributes[' | ||
| + | |||
| + | i.attributes[' | ||
| + | |||
| + | i.attributes[' | ||
| + | </ | ||
| + | |||
| + | The PERUN environment '' | ||
| + | |||
| + | ==== Running fix_genes_with_false_introns.py ==== | ||
| + | |||
| + | Running the script is fairly straightforward. | ||
| + | |||
| + | If you are running it locally, make sure you have you are in the right environment, | ||
| + | |||
| + | < | ||
| + | fix_genes_with_false_introns.py \ | ||
| + | -g < | ||
| + | -b < | ||
| + | -f < | ||
| + | > < | ||
| + | </ | ||
| + | |||
| + | If you want to run it on Perun, you could simply login to one of the compute nodes (for example Perun21): '' | ||
| + | |||
| + | < | ||
| + | # | ||
| + | |||
| + | #$ -S /bin/bash | ||
| + | #$ -cwd | ||
| + | #$ -m bea | ||
| + | #$ -M joran.martijn@dal.ca | ||
| + | |||
| + | source activate find-supported-orfs | ||
| + | |||
| + | fix_genes_with_false_introns.py \ | ||
| + | -g < | ||
| + | -b < | ||
| + | -f < | ||
| + | > < | ||
| + | |||
| + | conda deactivate | ||
| + | |||
| + | </ | ||
| + | |||
| + | And that's it! It's not the fastest script in the world but should take anywhere between a couple of hours and a day, depending on the size of your genome and BAM file. I'd recommend testing the script out on a smaller GFF3 file, for example one that contains features of only one contig, or a subset of a contig. | ||
| + | |||
| + | Each newly created gene in the output GFF3 file will have in its source field '' | ||
| + | |||
| + | Hopefully they do! If they don't, contact me (Joran) or try to see if you can update the code yourself. So far I've tested the script on Ergobibamus cyprinoides (which has a relatively simple genome structure) and Meteora (which is a bit more complicated). I already had to adjust the code quite a bit to yield sensible results with Meteora so chances are it may not work very well with another genome. | ||
| + | |||
| + | Remember that this script will not always yield correct genes. It is very simple and looks for just the longest ORFs in a defined region (while respecting supported introns, of course). Sometimes a shorter version of an ORF actually corresponds to a gene. It's up to you to detect these and curate them by hand. This script just does a lot of heavy lifting for you, but won't be perfect. | ||
automatically_fixing_gene_models.1719926737.txt.gz · Last modified: by 134.190.232.164
