Differences

This shows you the differences between two versions of the page.

--- nanopore_tools_for_polishing [2018/01/12 14:41] – 129.173.88.84
+++ nanopore_tools_for_polishing [2024/08/07 13:01] (current) – 134.190.232.164
@@ Line 1: / Line 1: @@
 ====== Polishing your MinION assembly ======
-Documentation by Jon Jerlstrom Hultqvist and Shelby Williams
+Documentation by Jon Jerlström Hultqvist and Shelby Williams
+(updates by Joran Martijn)
+**Be aware that some scripts and commands might not be working any longer on Perun due to the switch to the new conda-environment system. Sections will be progressively updated to reflect this.**
 The following are tools for polishing Oxford Nanopore assemblies, to help eliminate some of the error associated with the long-read data. Each polisher uses a mapping step prior to correcting the assemblies. BWA MEM is used here as the mapping tool, but can be replaced with another mapper.
@@ Line 17: / Line 20: @@
 {{ :bman_racon_1sh.doc |}}
-Formatting for this script:
+The following is a script using default settings:
 <code>
 minimap -t8 assembly.fasta /scratch2/user/path/to/raw_reads.fastq | racon /scratch2/user/path/to/raw_reads.fastq - assembly.fasta output_assembly.polish1.fasta
 </code>
-Note: for each polish performed, the script must be changed to reflect the new file names.
+**Note 1**: for each polish performed, the script must be changed to reflect the new file names.
+For genomes with lots of stubborn indels that persist after Nanopolishing and Unicycling, play around with the flags -e (error threshold), -w (window size), and -q (coverage threshold). I found that increasing window size and decreasing coverage threshold works well. The following is a script that contains one round of minimap2 mapping (read **Note 2** on interleaved short reads), Racon polishing, then Bowtie2 mapping (for viewing bam files):
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -pe threaded 8
+cd /your_working_directory/
+input=your_input_genome.fasta
+output=your_output_racon_round1.fasta
+minimap2 -t 8 $input interleavedshortreads.fq > temporary.paf
+echo "minimap2 done"
+racon -u -e 0.1 -w 500 -q 1 -t 8 interleavedshortreads.fq temporary.paf $input >$output
+echo "racon done"
+rm temporary.paf
+source activate bowtie2
+bowtie2-build $output $output
+bowtie2 -k 2 --threads 8 -x $output --very-sensitive \
+-1 your_forward_short_reads.fq \
+-2 your_reverse_short_reads.fq \
+-S $output.sam
+echo "Bowtie done"
+conda deactivate
+samtools view -F 4 -bS $output.sam |samtools sort > $output.sorted.bam
+samtools index $output.sorted.bam > $output.sorted.bam.bai
+rm $output.sam
+rm $output.*.bt2
+</code>
+You can modify the above script to loop for multiple rounds, but if you're looking for the best settings for correcting errors always check the bam files by eye after each round.
+**Note 2**: The "**interleavedshortreads.fq**" must not contain blanks/white spaces in headers. You can simply concatenate forward and reverse reads into one read file, but making them interleaved is useful for other read correction tools. To obtain interleaved reads:
+<code>
+source activate fastx_toolkit
+fastq-interleave read1.fq read2.fq > interleaved.fq
+</code>
+**WARNING**: Keep an eye on genome/contig size difference after each round of polishing. Using window size (-w) above 5000 may remove repetitive regions or regions with no short-read coverage.
 ----
@@ Line 42: / Line 82: @@
 First, make a BWA index of the assembly you wish to map onto by using the following command:
 <code>
 bwa index assembly_to_polish.fasta
 </code>
 Next, use the meteora_bwa.sh script to map the short reads onto your assembly. This will create a sorted.bam file. In this example, two paired-end read files will be mapped:
 <code>
-bwa mem -t 16 assembly_to_polish.fasta /scratch2/user/path/to/trimmedreads_1_PairNtrim.fastq.gz /scratch2/user/path/to/trimmedreads_2_PairNtrim.fastq.gz | samtools view -Sb - | samtools sort >  piloninput.sorted.bam
+bwa mem \
+    -t 16 \
+    assembly_to_polish.fasta \
+    /scratch2/user/path/to/trimmedreads_1_PairNtrim.fastq.gz \
+    /scratch2/user/path/to/trimmedreads_2_PairNtrim.fastq.gz | \
+        samtools sort --threads 16 -o piloninput.sorted.bam
 </code>
+UPDATE: You can now run bwa-mem2, which is an optimized version of bwa mem. It generates the exact same output, but is 2-4x faster:
+<code>
+bwa-mem2 mem \
+    -t 16 \
+    assembly_to_polish.fasta \
+    /scratch2/user/path/to/trimmedreads_1_PairNtrim.fastq.gz \
+    /scratch2/user/path/to/trimmedreads_2_PairNtrim.fastq.gz | \
+        samtools sort --threads 16 -o piloninput.sorted.bam
+</code>
 Once this is finished, use Pilon.sh to make changes in the assembly and generate a new consensus sequence. Pilon.sh can be formatted like so:
 <code>
-java -Xmx16G -jar /scratch2/software/pilon/pilon-1.22.jar --genome assembly_to_polish.fasta --frags piloninput.sorted.bam --output P2x --outdir Pilon2x --threads 16
+java -Xmx16G -jar /scratch2/software/pilon/pilon-1.22.jar \
+    --genome assembly_to_polish.fasta \
+    --frags piloninput.sorted.bam \
+    --output P2x \
+    --outdir Pilon2x \
+    --threads 16
 </code>
+UPDATE: The --threads option is as of v1.24 no longer maintained. It seems Pilon doesn't use more than 200-300% CPU (i.e. 3 threads) at most, so setting --threads to 4 orso should be sufficient.
 You may run into an error where Pilon does not recognize the bam file created from the previous step as being indexed. To fix this, run:
 <code>
 samtools index /path/to_bam_file
 </code>
 This will return a .bam.bai file. This file needs to be in the same folder as Pilon.sh, but does not need to be placed in the script.
@@ Line 63: / Line 134: @@
 Shell script:
 {{ :unicyclersh.docx |}}
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -pe threaded 16
+#cd /scratch2/jon/MinION/BMAN/assemblies/Unicycler_polish/
+echo "Starting"
+unset PYTHONPATH
+export PATH=/scratch2/software/gcc-6.3.0/bin:/scratch2/software/Python-3.6.0/bin:$PATH
+export LD_LIBRARY_PATH=/scratch2/software/gcc-6.3.0/lib64:/scratch2/software/Python-3.6.0/lib:$LD_LIBRARY_PATH
+/scratch2/software/Python-3.6.0/bin/unicycler_polish -1 /scratch2/shelbyw/RCL_Unicycler/RCL_1_PairNtrim.fq -2 /scratch2/shelbyw/RCL_Unicycler/RCL_2_PairNtrim.fq --long_reads RCL_MinION.CutAdapt75.3000.chop.fastq.gz -a RCL_unclean_AB_assembly_fix_Racon2_Pilon3.fasta --pilon=/scratch2/software/pilon/pilon-1.22.jar --samtools=/opt/perun/bin/samtools --threads 16
+echo "Done!"
+</code>
 Formatting:
@@ Line 100: / Line 193: @@
 If illumina reads are available it might be possible to skip nanopolish altogether and go directly to Pilon polishing after Racon. This has been exemplified in the Solanum penellii pre-print where nanopolish simply was not feasible.
-Location: /scratch2/software/nanopolish
+Location: /scratch2/software/anaconda/envs/nanopolish-0.12/bin
 Scripts:
@@ Line 119: / Line 212: @@
 nanopolish merge - merges the pieces into new a new consensus.
-**Updated Nanopolish protocol (as of Dec 1st 2017):**
+**Updated Nanopolish protocol (as of March 8 2020):**
-First, index your unchopped, raw reads file:
+First, index your unchopped, raw reads file.
+Use the sequencing_summary.txt produced by albacore during basecalling to speed up this step significantly. If you have several sequencing_summary.txt files these can be placed in a fof-file with the  path to the txt-file and called by -f. This also works in case of a single-file.:
+for the following step **DO NOT** **use more than 1** thread because the program is not threaded!
 <code>
-nanopolish index -d /path/to/raw_fast5s/ albacore_output.fastq
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -pe threaded 1
+cd $PWD
+fast5path=/scratch2/path2/fast5/
+fastq=/path2fastqlongreads.fastq
+seqsummary=/path2tosequencing_summary.txt
+source activate nanopolish-0.13.2
+export PATH=/scratch2/software/anaconda/envs/nanopolish-0.13.2/bin:$PATH
+nanopolish index -d $fast5path -s $seqsummary $fastq
+conda deactivate
 </code>
 This will create four files with the following extensions: .index, .index.gzi, .index.fai, .index.readdb .
-Next, index your draft genome (which has already been polished with Racon once or twice). You do not need to do this if you're using **minimap2** to create the alignment .bam file. I recommend splitting your draft genome into 5Mb pieces before doing this. **pyfasta split** is a handy tool for splitting. Either through the command line or a shell script, do:
+Next, map long-reads to the reference to be polished. It is highly recommended to use minimap2 for this purpose. It is quick, accurate and computationally inexpensive. In this example, the minimap2-output is piped to samtools directly to create a sorted bam-file. Finally this bam-file needs to be indexed to be useful to nanopolish consensus.
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -pe threaded 20
+cd $PWD
+export PATH=/scratch2/software/anaconda/bin:$PATH
+source activate python27-generic
+minimap2 -ax map-ont -t 20 \
+reference.fasta /path/to/reads.fastq \
+| samtools sort -o reads.sorted.bam -T reads.tmp -
+samtools index reads.sorted.bam
+conda deactivate
+</code>
+<del>(Next, index your draft genome (which has already been polished with Racon once or twice). You do not need to do this if you're using **minimap2** to create the alignment .bam file. I recommend splitting your draft genome into 5Mb pieces before doing this. **pyfasta split** is a handy tool for splitting. Either through the command line or a shell script, do:
 <code>bwa index piece_of_draft_genome.fa </code>
 Then alignment file .bam must be made:
 <code>bwa mem -x ont2d -t 8 draft.fa reads.fastq | samtools sort -o reads.sorted.bam -T reads.tmp
-samtools index reads.sorted.bam </code> "reads" must correspond to the name of the raw fastq and its nanoplish index file names. The "draft.fa" must also correspond to the split piece of the draft genome you bwa indexed.
+samtools index reads.sorted.bam </code> "reads" must correspond to the name of the raw fastq and its nanoplish index file names. The "draft.fa" must also correspond to the split piece of the draft genome you bwa indexed.)</del>
-Finally, build the actual consensus sequence:
+Finally, build the actual consensus sequence. nanopolish consensus will segment the genome into chunks of 50 kb and in this case run 20 jobs in parallell. BE AWARE: Nanopolish is resource hungry and it can easily swallow the resources of a node. If you end up with crashes, I would suggest throttling back the threads used by nanopolish or use a larger computing node. Each genome and dataset behaves slightly differently so it is recommended to keep a strict eye on the resources being used to nanopolish.:
 <code>
-python nanopolish_makerange.py draft.fa | parallel --results nanopolish.results -P 8 \
+#!/bin/bash
-    nanopolish variants --consensus polished.{1}.fa -w {1} -r reads.fa -b reads.sorted.bam -g draft.fa -t 4 --min-candidate-frequency 0.1
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -pe threaded 20
+cd $PWD
+export PATH=/scratch2/software/anaconda/envs/nanopolish-0.12/bin:$PATH
+source activate nanopolish-0.12
+nanopolish_makerange.py reference.fasta | parallel --results nanopolish.results -P 20 \
+nanopolish variants --consensus -o polished.{1}.fa -w {1} \
+-r /path/to/reads.fastq -b reads.sorted.bam -g reference.fasta -t 10 --min-candidate-frequency 0.1
+conda deactivate
 </code>
 This will create new pieces of consensus sequences in the current working directory (not in the nanopolish.results folder indicated in the script above)  named polished.*.fa. Merge them by:
 <code>
-python nanopolish_merge.py polished.*.fa > polished_genome.fa </code>
+python /scratch2/software/anaconda/envs/nanopolish-python3/bin/nanopolish_merge.py polished.*.fa > polished_reference.fa </code>
-Since you'll have several pieces of polished_genome.fa because you split the original draft genome, just **cat** them all into one file in the end.
+Sometimes, nanopolish will fail on a particular segment of fasta-file. Nanopolish merge will then indicate what segment(s) it failed to find.
+<code>
+error: segment starting at Seq_1:150000 is missing
+error: segment starting at Seq_1:900000 is missing
+error: some segments are missing, could not merge contig Seq_1
+</code>
+It is possible to separate out the offending fasta file and re-run this fasta file entry (Seq_1) through the minimap2-mapping and consensus steps to obtain the missing piece(s). These pieces can then be placed in the original nanopolish directory. But make sure that the file carries the same name as in the original multi-fasta file to make merging possible.
 Now, move on to Pilon or Unicycle!