from_nanopore_to_gene_prediction
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| from_nanopore_to_gene_prediction [2019/07/03 14:48] – sqk-lsk109 134.190.235.39 | from_nanopore_to_gene_prediction [2019/07/24 11:47] (current) – 134.190.235.39 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | From Nanopore to Gene Prediction: a pathway | + | ======From Nanopore to Gene Prediction: a pathway====== |
| + | By Greg and Jon | ||
| Because nanopore flow cells are expensive, it is common practice to try to sequence multiple samples at once. As part of the protocol, barcode sequences were added to the samples, which can now be used to sort the raw MINION data into individual bins, or folders, specific to the barcode, and thus the sample they’re attached to. | Because nanopore flow cells are expensive, it is common practice to try to sequence multiple samples at once. As part of the protocol, barcode sequences were added to the samples, which can now be used to sort the raw MINION data into individual bins, or folders, specific to the barcode, and thus the sample they’re attached to. | ||
| Line 8: | Line 10: | ||
| The following are **REQUIRED flags**: | The following are **REQUIRED flags**: | ||
| + | < | ||
| - | in_dir [the directory that your minion data has been outputted to] | + | < |
| - | + | [the directory you want the program to place the folders that it will sort the reads into.] | |
| - | out_dir | + | |
| Deepbinner is designed to be run in parallel with the sequencing, which is to say that in real time, deepbinner can take the output from nanopore and sort it. While we don’t usually do this in this lab, it’s a feature to keep in mind. | Deepbinner is designed to be run in parallel with the sequencing, which is to say that in real time, deepbinner can take the output from nanopore and sort it. While we don’t usually do this in this lab, it’s a feature to keep in mind. | ||
| - | There are a number of additional flags that can help refine the program’s | + | There are a number of additional flags that can help refine the program’s |
| + | |||
| + | Importantly, | ||
| **Model presets:** | **Model presets:** | ||
| - | native: preset for EXP-NBD103 read and end models. This is the default. | + | < |
| + | preset for EXP-NBD103 read and end models. This is the default. | ||
| - | rapid: preset for SQK-RBK004 read start model. | + | < |
| + | preset for SQK-RBK004 read start model. | ||
| **Models:** | **Models:** | ||
| Line 27: | Line 33: | ||
| These are used if the presets are not being used, and can be invoked with: | These are used if the presets are not being used, and can be invoked with: | ||
| - | s or start_model | + | < |
| + | for a model trained on the starts of reads | ||
| and | and | ||
| - | e or end_model | + | < |
| + | for a model trained on the ends of reads. | ||
| Largely we don’t have to worry about this in this lab. | Largely we don’t have to worry about this in this lab. | ||
| Line 39: | Line 47: | ||
| Two flags: | Two flags: | ||
| - | scan_size [value] This flag determines how much of a read’s start and end will be examined for barcode signals. **Defaults to 6144.** | + | < |
| + | This flag determines how much of a read’s start and end will be examined for barcode signals. **Defaults to 6144.** | ||
| - | score_diff [value] This flag determines how much of a difference between the best and second best barcode scores will be allowed, in order for classification to occur. **Default is 0.5** | + | < |
| + | This flag determines how much of a difference between the best and second best barcode scores will be allowed, in order for classification to occur. **Default is 0.5** | ||
| **two model (read start and read end) behaviour: | **two model (read start and read end) behaviour: | ||
| Three mutually exclusive flags which determine how lenient the program is in classifying a read. They are, listed here from most lenient to most stringent: | Three mutually exclusive flags which determine how lenient the program is in classifying a read. They are, listed here from most lenient to most stringent: | ||
| - | + | < | |
| - | require_either | + | --require_either |
| - | + | --require_start | |
| - | require_start | + | --require_both |
| - | + | </ | |
| - | require_both | + | |
| The first flag will allow the program to classify a read based on the barcode call of either the start or end of the read, so long as they do not disagree. | The first flag will allow the program to classify a read based on the barcode call of either the start or end of the read, so long as they do not disagree. | ||
| - | The second flag will classify a read based on a start barcode, and having an end barcode is opitional. **This is the default behaviour** | + | The second flag will classify a read based on a start barcode, and having an end barcode is optional. **This is the default behaviour** |
| The third flag requires the same barcode on both ends of the read in order for it to be classified. | The third flag requires the same barcode on both ends of the read in order for it to be classified. | ||
| Line 63: | Line 71: | ||
| There are four flags here that govern how the program runs. You probably will have no reason to alter these from their default. | There are four flags here that govern how the program runs. You probably will have no reason to alter these from their default. | ||
| - | batch_size [value] this is the neural network batch size **default is 256** | + | < |
| + | This is the neural network batch size **default is 256** | ||
| - | intra_op_paraelism_threads [value] | + | < |
| + | TensorFlow’s intra_op_parallelism_threads config option. **Default is 12** | ||
| - | inter_op_parallelism_threads | + | < |
| + | TensorFlow’s inter_op_parallelism_threads config option. **Default is 1** | ||
| - | device_count [value] TensorFlow’s device_count config option. **Default is 1.** | + | < |
| - | omp_num_threads [value] OMP_NUM_THREADS environment variable. **Default is 12.** | + | < |
| **Other:** | **Other:** | ||
| - | stop automatically | + | < |
| - | h or help Shows a help message | + | < |
| Here is an example shell script for deepbinner. A copy can be found at / | Here is an example shell script for deepbinner. A copy can be found at / | ||
| - | # | + | < |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| . / | . / | ||
| Line 92: | Line 103: | ||
| | | ||
| | | ||
| - | | + | |
| Line 98: | Line 109: | ||
| ---- | ---- | ||
| + | ===== Single to multi: converting many files into a few ===== | ||
| - | Nanopore technology generates a truly absurd amount of data files, which can be unwieldy to use, and for programs like Terminal to handle when trying to look into a folder with such huge amounts of data. Therefore, we will use another script to combine fast5 files into multi fast5 files. | ||
| - | #!/bin/bash | + | Nanopore technology generates a truly absurd amount of data files, which can be unwieldy to use, both in the sense of day to day work, as well as for programs like Terminal to handle. Therefore, we will use another script to combine fast5 files into multi fast5 files. This is the default method currently, however programs like Deepbinner work with the individual files, and most of our older data sets are still in single file format, which is important to keep in mind. |
| + | |||
| + | < | ||
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| . / | . / | ||
| Line 111: | Line 124: | ||
| | | ||
| | | ||
| - | | + | |
| - | conda deactivate | + | conda deactivate</ |
| Some notes: | Some notes: | ||
| - | batch_size | + | < |
| - | recursive will run the shell on both the files in the directory you specify, as well as in any directories that are inside the directory you specified. IE / | + | < |
| - | Additionally, | + | Additionally, |
| I have also created a script in / | I have also created a script in / | ||
| Line 128: | Line 141: | ||
| ---- | ---- | ||
| - | **Basecalling with Guppy** | + | ===== Basecalling with Guppy ===== |
| Once we've binned the reads, and reduced the number of files for easier use, we now have to // | Once we've binned the reads, and reduced the number of files for easier use, we now have to // | ||
| - | There are multiple different basecalling software that can be used, but here we outline the use of Guppy, a relatively recent one. | + | Guppy is the current basecaller from Oxford Nanopore Technologies and is currently being developed and updated. |
| The input for guppy will be the contents of the binned reads (example: / | The input for guppy will be the contents of the binned reads (example: / | ||
| - | # | + | < |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| . / | . / | ||
| Line 150: | Line 164: | ||
| | | ||
| | | ||
| - | | + | |
| Other than the input and output directory, most of this probably will not need to be altered. The flags flowcell and kit refer to the make of the flowcell used as well the as kit used in preparing the sample. Importantly, | Other than the input and output directory, most of this probably will not need to be altered. The flags flowcell and kit refer to the make of the flowcell used as well the as kit used in preparing the sample. Importantly, | ||
| Line 157: | Line 171: | ||
| One last note: every set of reads you want to basecall will have to be called upon individually, | One last note: every set of reads you want to basecall will have to be called upon individually, | ||
| + | |||
| + | |||
| + | ---- | ||
| + | |||
| + | ===== Trimming ===== | ||
| + | |||
| + | The next step is to trim the barcodes from the samples. While assembly programs might catch these elements and remove them, it's usually better to remove them as a separate step apart from assembly. | ||
| + | |||
| + | Before this step, however, you should merge all the files together into one single .fastq file for each barcode directory created by guppy, using the ' | ||
| + | |||
| + | On the command line, write: | ||
| + | < | ||
| + | This will take all files ending in the fastq suffix and merge them into a single file. You can name the output however you like. This is the file you will use for porechop. | ||
| + | |||
| + | Here we use porechop, which uses fastq files and trims off the barcodes. Additionally, | ||
| + | |||
| + | < | ||
| + | #$ -S /bin/bash | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 40 | ||
| + | | ||
| + | source activate python36-generic | ||
| + | | ||
| + | porechop \ | ||
| + | -i <input file> \ | ||
| + | -o <output file name> \ | ||
| + | --discard_middle --threads 40 --verbosity 2 | ||
| + | conda deactivate | ||
| + | |||
| + | For the output file, it's usually a good idea to include some sort of indication that the file has been trimmed, such as species.fq > species.chop.fq.</ | ||
| + | |||
| + | An example script has been provided in / | ||
| + | |||
| + | ===== Filtlong ===== | ||
| + | |||
| + | This step is important if you have lots and lots of data. Here, filtlong attempts to take the ' | ||
| + | |||
| + | In our lab, we typically use filtlong to obtain the longest reads. | ||
| + | |||
| + | An example script can be found in / | ||
| + | |||
| + | < | ||
| + | #$ -S /bin/bash | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 2 | ||
| + | |||
| + | | ||
| + | | ||
| + | |||
| + | | ||
| + | < | ||
| + | |||
| + | conda deactivate</ | ||
| + | |||
| + | Additional commands can be found [[https:// | ||
| + | |||
| + | Example script can be found at / | ||
| + | |||
| + | ===== Assembly ===== | ||
| + | ====Flye==== | ||
| + | There are a number of assembly tools that can be used. Here, we will use Flye, which is relatively quick, and fairly decent at assembly. | ||
| + | |||
| + | In order to use this, you should have some estimation of your genome size. The script below is a meta-genome assembly, which means that it treats the input as if it is a metagenome. This is useful to make sure mitochondrial DNA (or other plastids) are not accidentally thrown out during the assembly process. | ||
| + | |||
| + | Shell can be found at / | ||
| + | |||
| + | < | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 30 | ||
| + | #$ -o leg | ||
| + | |||
| + | | ||
| + | |||
| + | flye --nano-raw \ | ||
| + | < | ||
| + | | ||
| + | | ||
| + | |||
| + | conda deactivate</ | ||
| + | |||
| + | | ||
| + | Canu is another assembly program with higher fidelity at the cost of greater run time per assembly. It is specialized in assembling nanopore data and PacBio data, which makes it a good choice for our lab and work. There are three stages to Canu: correction, trimming and assembly. This means it can take the raw nanopore data and process it through most of the steps described above, should one choose. It can also restart an assembly should the worse happen and power or program failure occurs. | ||
| + | |||
| + | A basic example of the overall layout of canu looks like this: (from the [[https:// | ||
| + | < | ||
| + | [-s < | ||
| + | -p < | ||
| + | -d < | ||
| + | genomeSize=< | ||
| + | [other-options] \ | ||
| + | [<these are the type of data your feeding it> -pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq</ | ||
| + | | ||
| + | |||
| + | The p flag is required, as it has to do with the naming of the output files, while the d flag is not mandatory. Without it, canu will run within the current directory. However, since it is **not possible** to run two assemblies in one directory, it's best to fill this option with something unique. The s flag allows you to import parameters if you have them, which allows commonly used parameters to be replaced with ones specific to your assembly. | ||
| + | |||
| + | Parameters are written as ' | ||
| + | |||
| + | By default, correction, trimming and assembly are preformed, but it is possible to limit canu to only a specific task if you choose. | ||
| + | |||
| + | Below is an example script, and can be accessed / | ||
| + | < | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 20 | ||
| + | | ||
| + | export PATH=/ | ||
| + | | ||
| + | / | ||
| + | -p <prefix desired> -d < | ||
| + | maxMemory=[number + SI unit; example: 200g] | ||
| + | maxThreads=[number] | ||
| + | ' | ||
| + | -nanopore-raw <path to the guppy assemblied chopped fastq file \ | ||
| + | UseGrid=false</ | ||
| + | |||
| + | // | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | Prefix: | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | < | ||
| + | |||
| + | A fuller list can be found here: [[https:// | ||
from_nanopore_to_gene_prediction.1562176122.txt.gz · Last modified: by 134.190.235.39
