This is an old revision of the document!

From Nanopore to Gene Prediction: a pathway

Because nanopore flow cells are expensive, it is common practice to try to sequence multiple samples at once. As part of the protocol, barcode sequences were added to the samples, which can now be used to sort the raw MINION data into individual bins, or folders, specific to the barcode, and thus the sample they’re attached to.

We will do this a couple of times for accuracy’s sake.

The first tool we will use is deepbinner. This program will take the raw data, identify the barcode, and place it in one of thirteen folders: barcode01 through to barcode12, and unclassified, for sequences that no barcode could be identified for.

The following are REQUIRED flags:

in_dir [the directory that your minion data has been outputted to]

out_dir [the directory you want the program to place the folders that it will sort the reads into.]

Deepbinner is designed to be run in parallel with the sequencing, which is to say that in real time, deepbinner can take the output from nanopore and sort it. While we don’t usually do this in this lab, it’s a feature to keep in mind.

There are a number of additional flags that can help refine the program’s behaviour to your needs.

Model presets:

native: preset for EXP-NBD103 read and end models. This is the default.

rapid: preset for SQK-RBK004 read start model.

Models:

These are used if the presets are not being used, and can be invoked with:

s or start_model for a model trained on the starts of reads

and

e or end_model for a model trained on the ends of reads.

Largely we don’t have to worry about this in this lab.

Barcoding:

Two flags:

scan_size [value] This flag determines how much of a read’s start and end will be examined for barcode signals. Defaults to 6144.

score_diff [value] This flag determines how much of a difference between the best and second best barcode scores will be allowed, in order for classification to occur. Default is 0.5

two model (read start and read end) behaviour:

Three mutually exclusive flags which determine how lenient the program is in classifying a read. They are, listed here from most lenient to most stringent:

require_either

require_start

require_both

The first flag will allow the program to classify a read based on the barcode call of either the start or end of the read, so long as they do not disagree.

The second flag will classify a read based on a start barcode, and having an end barcode is opitional. This is the default behaviour

The third flag requires the same barcode on both ends of the read in order for it to be classified.

Performance:

There are four flags here that govern how the program runs. You probably will have no reason to alter these from their default.

batch_size [value] this is the neural network batch size default is 256

intra_op_paraelism_threads [value] TensorFlow’s intra_op_parallelism_threads config option. Default is 12

inter_op_parallelism_threads TensorFlow’s inter_op_parallelism_threads config option. Default is 1

device_count [value] TensorFlow’s device_count config option. Default is 1.

omp_num_threads [value] OMP_NUM_THREADS environment variable. Default is 12.

Other:

stop automatically stops deepbinner when it runs out of reads to process. By default, the program will continue to run until manually stopped.

h or help Shows a help message

Here is an example shell script for deepbinner. A copy can be found at /home/gseaton/public_scripts/

 #!/bin/bash
 #$ -S /bin/bash
 . /etc/profile
 #$ -cwd
 #$ -pe threaded 12
 
 source activate deepbinner
 /scratch2/software/Deepbinner-git-June-27-2019/deepbinner-runner.py realtime --in_dir [directory with the raw 
 MINION data] \
 --out_dir [the directory you’d like the stored folders placed]
 --start_model /scratch2/software/Deepbinner-git-June-27-2019/models/EXP-NBD103_read_starts \
 --end_model /scratch2/software/Deepbinner-git-June-27-2019/models/EXP-NBD103_read_ends --stop
 source deactivate

Single to multi: converting many files into a few

Nanopore technology generates a truly absurd amount of data files, which can be unwieldy to use, and for programs like Terminal to handle when trying to look into a folder with such huge amounts of data. Therefore, we will use another script to combine fast5 files into multi fast5 files.

 #!/bin/bash
 #$ -S /bin/bash
 . /etc/profile
 #$ -cwd
 #$ -pe threaded 4
 
 source activate ont-fast5-api
 
 single_to_multi_fast5 --input_path <path to barcode folder created by deepbinner IE: barcode03> \
 --save_path <path to the directory the output should be saved as>
 --filename_base <the prefix for the files, ie Species_barcode03 --batch_size 4000 --recursive
 
 conda deactivate

Some notes:

batch_size refers to the number of fast5 the program will combine together into one single file. 4000 is a good default to work with.

recursive will run the shell on both the files in the directory you specify, as well as in any directories that are inside the directory you specified. IE /scratch3/yourname/MINION/deepbinner/barcode03/another_file_level

Additionally, there is another command, multi_to_single_fast5 that can be run using the ont-fast5-api program. As the name implies, it does the reverse process, breaking apart a single multi fast5 file into individiual fast5 files.

I have also created a script in /home/gseaton/public_scripts that combines both deepbinner and single-to-multi-fast5 called deepbinner-combopack. This will launch both deepbinner and combine the files into single fast5 files.

Basecalling with Guppy

Once we've binned the reads, and reduced the number of files for easier use, we now have to basecall the fast5 to obtain usable results. Nanopore technology records the electrical differences in the membrane as the DNA strand (or RNA) passes through the pore, and as such, it takes some doing to decode this information into an actual sequence of bases.

There are multiple different basecalling software that can be used, but here we outline the use of Guppy, a relatively recent one.

The input for guppy will be the contents of the binned reads (example: /barcode04/). Below is a sample script that can be found at /home/gseaton/public_scripts/

 #!/bin/bash
 #$ -S /bin/bash
 . /etc/profile
 #$ -cwd
 #$ -pe threaded 40
 
 cd <the directory you want to be working in>
 
 /scratch2/software/ont-guppy-cpu-3.1.5/bin/guppy_basecaller \ 
 -i <input directory (ie /barcode04/)> \
 -s <output directory> \
 --flowcell FLO-MIN106 --kit SQK-LSK109 --calib_detect -q 0 --recursive \
 --num_callers 20 --cpu_threads_per_caller 2 \
 -barcode_kits "EXP-NBD103" \
 --trim_strategy dna

Other than the input and output directory, most of this probably will not need to be altered. The flags flowcell and kit refer to the make of the flowcell used as well the as kit used in preparing the sample. Importantly, the num_callers and cpu_threads_per_caller values must multiple together to get the value specified in the line #$ -pe threaded. Here, we're dedicating 2 threads to each of the 20 callers we're using.

More flags can be found by typing /scratch2/software/ont-guppy-cpu-3.1.5/bin/guppy_basecaller in the command line on perun.

One last note: every set of reads you want to basecall will have to be called upon individually, but it can be placed all in the same shell so you only have to launch it once.

Trimming

The next step is to trim the barcodes from the samples. While assembly programs might catch these elements and remove them, it's usually better to remove them as a separate step apart from assembly.

Here we use porechop, which uses fastq files and trims off the barcodes. Additionally, by the default outlines here any barcodes detected /within/ a read results in that read being discarded. This is not the only option: porechop has flags that will have the program attempt to split the read containing a barcode middle into two reads and keep them. But this is usually more trouble than its worth.

  #!/bin/bash
  #$ -S /bin/bash
  #$ -cwd
  #$ -pe threaded 40
  
  source activate python36-generic
  
  porechop \
  -i <input file> \
  -o <output file name> \
  --discard_middle --threads 40 --verbosity 2
  conda deactivate

For the output file, it's usually a good idea to include some sort of indication that the file has been trimmed, such as species.fq > species.chop.fq.

An example script has been provided in /home/gseaton/public_scripts/

cgeb2001's DokuWiki!

Table of Contents

From Nanopore to Gene Prediction: a pathway

Single to multi: converting many files into a few

Basecalling with Guppy

Trimming