Differences

This shows you the differences between two versions of the page.

--- ploidy_analysis_using_ploidyngs [2019/09/27 10:36] – created 129.173.90.41
+++ ploidy_analysis_using_ploidyngs [2022/06/24 16:01] (current) – [UPDATE (June 2022)] 134.190.232.29
@@ Line 5: / Line 5: @@
 Script for running a ploidyNGS analysis on perun
+<code>
 #!/bin/bash\\
 #$ -S /bin/bash\\
 . /etc/profile\\
 #$ -cwd
 source activate ploidyNGS-preq
-/scratch2/software/ploidyNGS/ploidyNGS.py -o outputfilename -b nameofsortedbamfile.bam
+/scratch2/software/ploidyNGS/ploidyNGS.py \
+    -o outputfilename \
+    -b nameofsortedbamfile.bam
 conda deactivate
+</code>
 Perun will immediately generate error messages that look dire if not fatal
+''
 "fatal: not a git repository (or any parent up to mount point /misc)
 Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set)."
-but you can ignore these and the program continues to run
+''
+This is the result of a python function defined by ploidyNGS to try to identify which version of the script you are running, but it typically doesn't work. You can safely ignore these and the program continues to run
-One of the options you will be interested in is
--d 100 (default), --max_depth 100 (default)
-    Max number of reads kepth at each position in the
-    reference genome (integer, default: 100)
-    If you have low coverage you will need to change this otherwise you won't have
-    many sites that qualify for examination.
+One of the options you will be interested in is ''-d'' or ''--max_depth'', which is set to 100 by default.
+<code>
+Max number of reads kepth at each position in the reference genome (integer, default: 100)
+</code>
+This sets the maximum read depth ''ploidyNGS.py'' will read until it moves on to the next alignment position. For example, if you have it set to a 100, but at position 123 you have actually 500 reads mapped, it will only load the basecalls of the first 100 reads to calculate allele frequencies from. I (Joran) found that it is pretty safe to increase this value to say 2000, to use as much information as possible.
 Examine the tab output file. In particular, if the lines for Third and Fourth contain
@@ Line 38: / Line 45: @@
 To see the sorted content of the Third and Fourth lines
-grep Fourth myoutput.tab |cut -f4 |sort -n\\
+<code>
-grep Third myoutput.tab |cut -f4 |sort -n
+$ grep Fourth myoutput.tab | cut -f4 | sort -n
+$ grep Third  myoutput.tab | cut -f4 | sort -n
+</code>
+To remove lines containing "Third" and "Fourth" from the .tab file:
+<code>
+$ grep -vP "Third|Fourth" myoutput.tab > mytabforhistogram.tab
+</code>
-To remove Third and Fourth lines from the tab
+Once satisfied with the tab you need to make a PDF (it is suppose to generate it automatically
+but it doesn't)
-grep -v Fourth  myoutput.tab |grep -v Third > mytabforhistogram.tab
+On the command line execute
+<code>
+$ Rscript --vanilla /scratch2/software/ploidyNGS/ploidyNGS_generateHistogram.R mytabforhistogram.tab
+</code>
-Once satisfied with the tab you need to make a pdf (it is suppose to generate it automatically
+This will generate a PDF called ''NA'' which you can transfer to your home computer to look at.
-but it doesn't)
-On the commandline
+==== UPDATE (June 2022) ====
+I (Joran) have updated the ''ploidyNGS.py'' and ''ploidyNGS_generateHistogram.R'' scripts. They are available in Perun under the new names ''ploidyNGS_minCov.py'' and ''ploidyNGS_generateHistogram_minCov.R'', respectively.
+The main new feature is the ''--min_cov'' option, which is set to 0 by default. It allows the user to ignore positions that have a lower coverage than the specified value in its allele frequency calculations. This can be useful if you have mysterious 50%/50% or 33%/67% peaks in your histogram which is otherwise strongly hinting haploidy. In my experience these mysterious peaks can stem from low coverage regions with sequencing errors. For example, if you have a position that is covered twice, and one of them is an error, it will yield a 50%/50% peak even though there is no diploidy going on here.
+Other updates:
+  * Positions reported in the .tab file are now 1-indexed (i.e. starting at 1) instead of 0-indexed (i.e. starting at 0)
+  * If you have ''/scratch2/software/ploidyNGS/'' in your ''$PATH'', the updated Python script will properly execute the histogram R script without any issues. The PDF file will have a proper file name ending in .pdf, instead of the mysterious ''NA''
+  * The axis titles in the histogram will be a bit easier to understand
+  * The script will now report the total number of heteromorphic positions per contig and over all contigs
+To call it, use a Perun submission script like this one
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -q 256G-batch
+source activate ploidyNGS-dependencies
+export PATH="/scratch2/software/ploidyNGS:$PATH"
+BAMFILE='dnaseq_vs_canu_ergo_assembly.sorted.bam'
+MINCOV=10
+MAXDEPTH=2000
+OUTBASE='dnaseq_vs_canu_ergo_assembly'
+ploidyNGS_minCov.py \
+    --out $OUTBASE \
+    --bam $BAMFILE \
+    --min_cov $MINCOV \
+    --max_depth $MAXDEPTH
+conda deactivate
+</code>
-Rscript --vanilla /scratch2/software/ploidyNGS/ploidyNGS_generateHistogram.R mytabforhistogram.tab
+A more elaborate script is available on the GitHub repository of the Roger lab: https://github.com/RogerLab/gospel_of_andrew.
-This will generate a pdf called NA which you can transfer to your home computer to look at.
+The updated scripts are also available on the original GitHub repository https://github.com/diriano/ploidyNGS under the original script names ''ploidyNGS.py'' and ''ploidyNGS_generateHistogram.R''