This is an old revision of the document!
ASSEMBLING LONG READ DATA
Documentation by Sarah Shah
When you have your porechopped reads in fastq and fasta formats, try out the following assemblers:
Programs: ABruijn (https://github.com/fenderglass/ABruijn), Canu (http://canu.readthedocs.io/en/latest/quick-start.html), smartdenovo (https://github.com/ruanjue/smartdenovo), miniasm (https://github.com/lh3/miniasm)
ABruijn
ABruijn is relatively simple to use. As its name suggests, it uses A-Bruijn graph to find the overlaps between reads. It has a polishing step to improve quality.
It needs a fasta input. The final product is a polished_(1+number of iterations specified).fasta.
#!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 10 unset PYTHONPATH export PATH=/scratch2/software/Python-2.7.13/bin:$PATH export LD_LIBRARY_PATH=/scratch2/software/Python-2.7.13/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/scratch2/software/hdf5-1.8.18/lib:$LD_LIBRARY_PATH export PATH=/scratch2/software/blasr/bin:$PATH cd /path/to your working directory /scratch2/software/ABruijn-1.0/bin/abruijn /path/to_your_fasta /path/to_an_output_directory <estimated coverage> --platform nano --threads 10
Canu
This assembler has the most complicated settings! It produces the most comprehensive set of outputs though, as it goes through correction, trimming, and assembling steps. Important outputs are:
- .contigs.gfa file. You can view the graph with Bandage.
- .contigs.fasta
The following script contains settings for a eukaryote-bacteria dataset where contaminating bacterial genomes can be separated easily.
#!/bin/bash #$ -S /bin/bash . /etc/profile #$ -cwd #$ -pe threaded 20 export PATH=/opt/perun/jre1.8.0_121/bin:$PATH /opt/perun/canu/canu-1.6/Linux-amd64/bin/canu \ -p outputfileprefix -d nameofdirectorytostorealloutputs \ maxMemory=200g \ maxThreads=20 \ 'corMinCoverage=0' 'corOutCoverage=all' 'corMhapSensitivity=high' 'corMaxEvidenceErate=0.15' 'correctedErrorRate=0.105' 'genomeSize=5m' 'corMaxEvidenceCoverageLocal=10' 'corM axEvidenceCoverageGlobal=10' 'oeaMemory=32' 'redMemory=32' 'batMemory=200' \ -nanopore-raw /path/to_your_fastq_file \ useGrid=false # The first set of parameters increases the sensitivity and keeps as much data as possible. The next set limits how many other reads any given read can correct to try to avoi d mixing strains and finally, some default memory is increased but this is not strictly necessary. # From post https://github.com/marbl/canu/issues/634
smartdenovo
Download smartdenovo to your account on Perun.
/path/to/smartdenovo/smartdenovo.pl -p prefix reads.fa > prefix.mak make -f prefix.mak
The .utg file is the important output.
miniasm
The simplest and the fastest of all the assemblers here. First, self-map the fasta file using minimap2:
minimap2 -x ava-ont reads.fa reads.fa | gzip -1 > reads.paf.gz
Then, use miniasm:
miniasm -f reads.fq reads.paf.gz > reads.gfa
View the .gfa file using Bandage. You can convert the .gfa file to a fasta file by:
awk '/^S/{print">"$2"\n"$3}' in.gfa | fold > out.fa
The Unicycler Github page (https://github.com/rrwick/Unicycler) has nice examples of how good, alright, and terrible graphs look like.
Do a quick BLAST search of your contigs and separate out the eukaryotic and bacterial contigs. Compare your assemblies using QUAST (http://quast.bioinf.spbau.ru/) and continue to polishing and correcting your chosen assembly.
