from_nanopore_to_gene_prediction
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| from_nanopore_to_gene_prediction [2019/07/12 11:02] – 134.190.235.39 | from_nanopore_to_gene_prediction [2019/07/24 11:47] (current) – 134.190.235.39 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== From Nanopore to Gene Prediction: a pathway ====== | + | ======From Nanopore to Gene Prediction: a pathway====== |
| + | By Greg and Jon | ||
| Line 9: | Line 10: | ||
| The following are **REQUIRED flags**: | The following are **REQUIRED flags**: | ||
| + | < | ||
| - | in_dir [the directory that your minion data has been outputted to] | + | < |
| - | + | [the directory you want the program to place the folders that it will sort the reads into.] | |
| - | out_dir | + | |
| Deepbinner is designed to be run in parallel with the sequencing, which is to say that in real time, deepbinner can take the output from nanopore and sort it. While we don’t usually do this in this lab, it’s a feature to keep in mind. | Deepbinner is designed to be run in parallel with the sequencing, which is to say that in real time, deepbinner can take the output from nanopore and sort it. While we don’t usually do this in this lab, it’s a feature to keep in mind. | ||
| - | There are a number of additional flags that can help refine the program’s | + | There are a number of additional flags that can help refine the program’s |
| + | |||
| + | Importantly, | ||
| **Model presets:** | **Model presets:** | ||
| - | native: preset for EXP-NBD103 read and end models. This is the default. | + | < |
| + | preset for EXP-NBD103 read and end models. This is the default. | ||
| - | rapid: preset for SQK-RBK004 read start model. | + | < |
| + | preset for SQK-RBK004 read start model. | ||
| **Models:** | **Models:** | ||
| Line 28: | Line 33: | ||
| These are used if the presets are not being used, and can be invoked with: | These are used if the presets are not being used, and can be invoked with: | ||
| - | s or start_model | + | < |
| + | for a model trained on the starts of reads | ||
| and | and | ||
| - | e or end_model | + | < |
| + | for a model trained on the ends of reads. | ||
| Largely we don’t have to worry about this in this lab. | Largely we don’t have to worry about this in this lab. | ||
| Line 40: | Line 47: | ||
| Two flags: | Two flags: | ||
| - | scan_size [value] This flag determines how much of a read’s start and end will be examined for barcode signals. **Defaults to 6144.** | + | < |
| + | This flag determines how much of a read’s start and end will be examined for barcode signals. **Defaults to 6144.** | ||
| - | score_diff [value] This flag determines how much of a difference between the best and second best barcode scores will be allowed, in order for classification to occur. **Default is 0.5** | + | < |
| + | This flag determines how much of a difference between the best and second best barcode scores will be allowed, in order for classification to occur. **Default is 0.5** | ||
| **two model (read start and read end) behaviour: | **two model (read start and read end) behaviour: | ||
| Three mutually exclusive flags which determine how lenient the program is in classifying a read. They are, listed here from most lenient to most stringent: | Three mutually exclusive flags which determine how lenient the program is in classifying a read. They are, listed here from most lenient to most stringent: | ||
| - | + | < | |
| - | require_either | + | --require_either |
| - | + | --require_start | |
| - | require_start | + | --require_both |
| - | + | </ | |
| - | require_both | + | |
| The first flag will allow the program to classify a read based on the barcode call of either the start or end of the read, so long as they do not disagree. | The first flag will allow the program to classify a read based on the barcode call of either the start or end of the read, so long as they do not disagree. | ||
| - | The second flag will classify a read based on a start barcode, and having an end barcode is opitional. **This is the default behaviour** | + | The second flag will classify a read based on a start barcode, and having an end barcode is optional. **This is the default behaviour** |
| The third flag requires the same barcode on both ends of the read in order for it to be classified. | The third flag requires the same barcode on both ends of the read in order for it to be classified. | ||
| Line 64: | Line 71: | ||
| There are four flags here that govern how the program runs. You probably will have no reason to alter these from their default. | There are four flags here that govern how the program runs. You probably will have no reason to alter these from their default. | ||
| - | batch_size [value] this is the neural network batch size **default is 256** | + | < |
| + | This is the neural network batch size **default is 256** | ||
| - | intra_op_paraelism_threads [value] | + | < |
| + | TensorFlow’s intra_op_parallelism_threads config option. **Default is 12** | ||
| - | inter_op_parallelism_threads | + | < |
| + | TensorFlow’s inter_op_parallelism_threads config option. **Default is 1** | ||
| - | device_count [value] TensorFlow’s device_count config option. **Default is 1.** | + | < |
| - | omp_num_threads [value] OMP_NUM_THREADS environment variable. **Default is 12.** | + | < |
| **Other:** | **Other:** | ||
| - | stop automatically | + | < |
| - | h or help Shows a help message | + | < |
| Here is an example shell script for deepbinner. A copy can be found at / | Here is an example shell script for deepbinner. A copy can be found at / | ||
| - | # | + | < |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| . / | . / | ||
| Line 93: | Line 103: | ||
| | | ||
| | | ||
| - | | + | |
| Line 102: | Line 112: | ||
| - | Nanopore technology generates a truly absurd amount of data files, which can be unwieldy to use, and for programs like Terminal to handle | + | Nanopore technology generates a truly absurd amount of data files, which can be unwieldy to use, both in the sense of day to day work, as well as for programs like Terminal to handle. Therefore, we will use another script to combine fast5 files into multi fast5 files. This is the default method currently, however programs like Deepbinner work with the individual files, and most of our older data sets are still in single file format, which is important to keep in mind. |
| - | # | + | < |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| . / | . / | ||
| Line 114: | Line 124: | ||
| | | ||
| | | ||
| - | | + | |
| - | conda deactivate | + | conda deactivate</ |
| Some notes: | Some notes: | ||
| - | batch_size | + | < |
| - | recursive will run the shell on both the files in the directory you specify, as well as in any directories that are inside the directory you specified. IE / | + | < |
| - | Additionally, | + | Additionally, |
| I have also created a script in / | I have also created a script in / | ||
| Line 136: | Line 146: | ||
| Once we've binned the reads, and reduced the number of files for easier use, we now have to // | Once we've binned the reads, and reduced the number of files for easier use, we now have to // | ||
| - | There are multiple different basecalling software that can be used, but here we outline the use of Guppy, a relatively recent one. | + | Guppy is the current basecaller from Oxford Nanopore Technologies and is currently being developed and updated. |
| The input for guppy will be the contents of the binned reads (example: / | The input for guppy will be the contents of the binned reads (example: / | ||
| - | # | + | < |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| . / | . / | ||
| Line 154: | Line 164: | ||
| | | ||
| | | ||
| - | | + | |
| Other than the input and output directory, most of this probably will not need to be altered. The flags flowcell and kit refer to the make of the flowcell used as well the as kit used in preparing the sample. Importantly, | Other than the input and output directory, most of this probably will not need to be altered. The flags flowcell and kit refer to the make of the flowcell used as well the as kit used in preparing the sample. Importantly, | ||
| Line 169: | Line 179: | ||
| The next step is to trim the barcodes from the samples. While assembly programs might catch these elements and remove them, it's usually better to remove them as a separate step apart from assembly. | The next step is to trim the barcodes from the samples. While assembly programs might catch these elements and remove them, it's usually better to remove them as a separate step apart from assembly. | ||
| - | Before this step, however, | + | Before this step, however, |
| + | |||
| + | On the command line, write: | ||
| + | < | ||
| + | This will take all files ending in the fastq suffix and merge them into a single file. You can name the output however you like. This is the file you will use for porechop. | ||
| Here we use porechop, which uses fastq files and trims off the barcodes. Additionally, | Here we use porechop, which uses fastq files and trims off the barcodes. Additionally, | ||
| - | #!/bin/bash | + | |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| #$ -cwd | #$ -cwd | ||
| Line 186: | Line 200: | ||
| conda deactivate | conda deactivate | ||
| - | For the output file, it's usually a good idea to include some sort of indication that the file has been trimmed, such as species.fq > species.chop.fq. | + | For the output file, it's usually a good idea to include some sort of indication that the file has been trimmed, such as species.fq > species.chop.fq.</ |
| An example script has been provided in / | An example script has been provided in / | ||
| - | ==== Filtlong ==== | + | ===== Filtlong |
| - | This step is important if you have lots and lots of data. Here, filtlong attempts to take the ' | + | This step is important if you have lots and lots of data. Here, filtlong attempts to take the ' |
| In our lab, we typically use filtlong to obtain the longest reads. | In our lab, we typically use filtlong to obtain the longest reads. | ||
| Line 198: | Line 212: | ||
| An example script can be found in / | An example script can be found in / | ||
| - | # | + | < |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| #$ -cwd | #$ -cwd | ||
| Line 209: | Line 223: | ||
| < | < | ||
| - | conda deactivate | + | conda deactivate</ |
| Additional commands can be found [[https:// | Additional commands can be found [[https:// | ||
| Line 215: | Line 229: | ||
| Example script can be found at / | Example script can be found at / | ||
| - | === Assembly === | + | ===== Assembly |
| ====Flye==== | ====Flye==== | ||
| There are a number of assembly tools that can be used. Here, we will use Flye, which is relatively quick, and fairly decent at assembly. | There are a number of assembly tools that can be used. Here, we will use Flye, which is relatively quick, and fairly decent at assembly. | ||
| Line 223: | Line 237: | ||
| Shell can be found at / | Shell can be found at / | ||
| - | # | + | < |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| . / | . / | ||
| Line 232: | Line 246: | ||
| | | ||
| - | fly --nano-raw \ | + | flye --nano-raw \ |
| < | < | ||
| | | ||
| - | | + | |
| - | conda deactivate. | + | conda deactivate</ |
| - | | + | |
| - | Canu is another assembly program with higher fidelity at the cost of greater run time per assembly. It is specialized in assembling | + | Canu is another assembly program with higher fidelity at the cost of greater run time per assembly. It is specialized in assembling |
| A basic example of the overall layout of canu looks like this: (from the [[https:// | A basic example of the overall layout of canu looks like this: (from the [[https:// | ||
| - | canu [-correct | -trim | -assemble | -trim-assemble] \ | + | |
| [-s < | [-s < | ||
| -p < | -p < | ||
| Line 249: | Line 263: | ||
| genomeSize=< | genomeSize=< | ||
| [other-options] \ | [other-options] \ | ||
| - | [<these are the type of data your feeding it> -pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq | + | [<these are the type of data your feeding it> -pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq</ |
| | | ||
| - | The p flag is required, as it has to do with the naming of the output files, while the d flag is not manatory. Without it, canu will run within the currect | + | The p flag is required, as it has to do with the naming of the output files, while the d flag is not mandatory. Without it, canu will run within the current |
| Parameters are written as ' | Parameters are written as ' | ||
| Line 259: | Line 273: | ||
| Below is an example script, and can be accessed / | Below is an example script, and can be accessed / | ||
| - | #!/bin/bash | + | |
| #$ -S /bin/bash | #$ -S /bin/bash | ||
| . / | . / | ||
| Line 273: | Line 287: | ||
| ' | ' | ||
| -nanopore-raw <path to the guppy assemblied chopped fastq file \ | -nanopore-raw <path to the guppy assemblied chopped fastq file \ | ||
| - | UseGrid=false | + | UseGrid=false</ |
| // | // | ||
| - | **corMinCoverage**; this governs the correction: if set to 0, the correction is non-lossy such that the output is the same length as the input. This keeps errors, which will be removed later in the process. Default is 4. | + | < |
| - | **corOutCoverage**: this can have an integer value or ' | + | < |
| - | **corMhapSensitivity**: Settings: low/ | + | < |
| - | **corMaxEvidenceErate**: Setting this limits read correction to only overlaps at or below this value (between 0 and 1). Default is unlimited. | + | < |
| - | **correctedErrorRate**: Setting is a fraction (between 0 and 1). Default depends on the technology used: 0.045 for PacBio, or 0.144 for Nanopore reads. It is recommended that for low coverage data, that the default value should be increased by 1%, and for high coverage, decreased by 1%. | + | < |
| - | **genomeSize**: This is the only required parameter. It has no default value, and is based on the size of the genome you're working with. | + | < |
| - | **corMaxEvidenceCoverageGlobal**: Limits reads used for correction to support at most this coverage. Default is 1.0 times estimated coverage | + | < |
| - | **corMaxEvidenceCoverageLocal**: limits reads being corrected to at most this much evidence coverage. Default is 10 times estimated coverage. | + | < |
| - | **[prefix]Memory**: the memory limit, Set in gigabytes, and by default is unset. prefixes are used to set the memory limits for specific tasks: | + | < |
| Prefix: | Prefix: | ||
| - | //oea:// error adjustment in overlaps | + | < |
| - | //red:// error detection in reads | + | < |
| - | //bat:// unitig/ | + | < |
| A fuller list can be found here: [[https:// | A fuller list can be found here: [[https:// | ||
from_nanopore_to_gene_prediction.1562940179.txt.gz · Last modified: by 134.190.235.39
