Differences

This shows you the differences between two versions of the page.

--- from_nanopore_to_gene_prediction [2019/07/17 14:39] – 134.190.235.39
+++ from_nanopore_to_gene_prediction [2019/07/24 11:47] (current) – 134.190.235.39
@@ Line 1: / Line 1: @@
 ======From Nanopore to Gene Prediction: a pathway======
+By Greg and Jon
@@ Line 18: / Line 18: @@
 There are a number of additional flags that can help refine the program’s behavior to your needs.
+Importantly, the fast5 reads will be physically transferred from the input folder to the new respective folders, so don't be alarmed when the inputs 'disappear'!
 **Model presets:**
@@ Line 61: / Line 63: @@
 The first flag will allow the program to classify a read based on the barcode call of either the start or end of the read, so long as they do not disagree.
-The second flag will classify a read based on a start barcode, and having an end barcode is opitional. **This is the default behaviour**
+The second flag will classify a read based on a start barcode, and having an end barcode is optional. **This is the default behaviour**
 The third flag requires the same barcode on both ends of the read in order for it to be classified.
@@ Line 110: / Line 112: @@
-Nanopore technology generates a truly absurd amount of data files, which can be unwieldy to use, and for programs like Terminal to handle when trying to look into a folder with such huge amounts of data. Therefore, we will use another script to combine fast5 files into multi fast5 files.
+Nanopore technology generates a truly absurd amount of data files, which can be unwieldy to use, both in the sense of day to day work, as well as for programs like Terminal to handle. Therefore, we will use another script to combine fast5 files into multi fast5 files. This is the default method currently, however programs like Deepbinner work with the individual files, and most of our older data sets are still in single file format, which is important to keep in mind.
    <code>#!/bin/bash
@@ Line 122: / Line 124: @@
    single_to_multi_fast5 --input_path <path to barcode folder created by deepbinner IE: barcode03> \
    --save_path <path to the directory the output should be saved as>
-   --filename_base <the prefix for the files, ie Species_barcode03 --batch_size 4000 --recursive
+   --filename_base <the prefix for the files, ie Species_barcode03 --batch_size 8000 --recursive
    conda deactivate</code>
@@ Line 128: / Line 130: @@
 Some notes:
-<code>--batch_size</code> Refers to the number of fast5 the program will combine together into one single file. 4000 is a good default to work with.
+<code>--batch_size</code> Refers to the number of fast5 the program will combine together into one single file. 8000 is a good default to work with.
 <code>--recursive</code> will run the shell on both the files in the directory you specify, as well as in any directories that are inside the directory you specified. IE /scratch3/yourname/MINION/deepbinner/barcode03/another_file_level
-Additionally, there is another command, multi_to_single_fast5 that can be run using the ont-fast5-api program. As the name implies, it does the reverse process, breaking apart a single multi fast5 file into individiual fast5 files.
+Additionally, there is another command, multi_to_single_fast5 that can be run using the ont-fast5-api program. As the name implies, it does the reverse process, breaking apart a single multi fast5 file into individual fast5 files. Deepbinner will do this for you if it detects a multi-fast5 file, however it is always good to know how to do it by hand as well.
 I have also created a script in /home/gseaton/public_scripts that combines both deepbinner and single-to-multi-fast5 called deepbinner-combopack. This will launch both deepbinner and combine the files into single fast5 files.
@@ Line 144: / Line 146: @@
 Once we've binned the reads, and reduced the number of files for easier use, we now have to //basecall// the fast5 to obtain usable results. Nanopore technology records the electrical differences in the membrane as the DNA strand (or RNA) passes through the pore, and as such, it takes some doing to decode this information into an actual sequence of bases.
-There are multiple different basecalling software that can be used, but here we outline the use of Guppy, a relatively recent one.
+Guppy is the current basecaller from Oxford Nanopore Technologies and is currently being developed and updated.
 The input for guppy will be the contents of the binned reads (example: /barcode04/). Below is a sample script that can be found at /home/gseaton/public_scripts/
@@ Line 177: / Line 179: @@
 The next step is to trim the barcodes from the samples. While assembly programs might catch these elements and remove them, it's usually better to remove them as a separate step apart from assembly.
-Before this step, however, enter the merge all the files together into one single .fastq file for each barcode directory created by guppy. This is the file you will use for porechop.
+Before this step, however, you should merge all the files together into one single .fastq file for each barcode directory created by guppy, using the 'cat' command.
+On the command line, write:
+<code> cat *.fastq > merged_reads.fq</code>
+This will take all files ending in the fastq suffix and merge them into a single file. You can name the output however you like. This is the file you will use for porechop.
 Here we use porechop, which uses fastq files and trims off the barcodes. Additionally, by the default outlines here any barcodes detected /within/ a read results in that read being discarded. This is not the only option: porechop has flags that will have the program attempt to split the read containing a barcode middle into two reads and keep them. But this is usually more trouble than its worth.
@@ Line 200: / Line 206: @@
 ===== Filtlong =====
-This step is important if you have lots and lots of data. Here, filtlong attempts to take the 'best' of the given data set, and creates a file containing that for later use. The 'best'ness can be determined by the researcher with different flags. For example, if one wanted to take only the most accurate reads, this program would do that for you.
+This step is important if you have lots and lots of data. Here, filtlong attempts to take the 'best' of the given data set, and creates a file containing that for later use. The 'bestness' can be determined by the researcher with different flags. For example, if one wanted to take only the most accurate reads, this program would do that for you.
 In our lab, we typically use filtlong to obtain the longest reads.
@@ Line 240: / Line 246: @@
    source activate flye
-   fly --nano-raw \
+   flye --nano-raw \
    <input file. Should be trimmed, and if necessary, filtered using filtlong> \
    --meta \
-   --genome-size <estimate it in the format 20m> --out-dir <out directorary> --threads 30 --iterations 2
+   --genome-size <estimate it in the format 20m> --out-dir <out directory> --threads 30 --iterations 2
    conda deactivate</code>
  ====Canu====
-Canu is another assembly program with higher fidelity at the cost of greater run time per assembly. It is specialized in assembling NanoPore data and BioPac data, which makes it a good choice for our lab and work. There are three stages to Canu: correction, trimming and assembly. This means it can take the raw nanopore data and process it through most of the steps described above, should one choose. It can also restart an assembly should the worse happen and power or program failure occurs.
+Canu is another assembly program with higher fidelity at the cost of greater run time per assembly. It is specialized in assembling nanopore data and PacBio data, which makes it a good choice for our lab and work. There are three stages to Canu: correction, trimming and assembly. This means it can take the raw nanopore data and process it through most of the steps described above, should one choose. It can also restart an assembly should the worse happen and power or program failure occurs.
 A basic example of the overall layout of canu looks like this: (from the [[https://canu.readthedocs.io/en/latest/tutorial.html|the tutorial]].
@@ Line 260: / Line 266: @@
-The p flag is required, as it has to do with the naming of the output files, while the d flag is not manatory. Without it, canu will run within the currect directory. However, since it is **not possible** to run two assemblies in one directory, it's best to fill this option with something unique. The s flag allows you to import parameters if you have them, which allows commonly used parameters to be replaced with ones specific to your assembly.
+The p flag is required, as it has to do with the naming of the output files, while the d flag is not mandatory. Without it, canu will run within the current directory. However, since it is **not possible** to run two assemblies in one directory, it's best to fill this option with something unique. The s flag allows you to import parameters if you have them, which allows commonly used parameters to be replaced with ones specific to your assembly.
 Parameters are written as 'something=value', and the most common ones are maxMemory and maxThreads. genomeSize is always required, while all others can be set to default.
@@ Line 285: / Line 291: @@
 //**Parameters explained:**//
-<code>**corMinCoverage**;</code> this governs the correction: if set to 0, the correction is non-lossy such that the output is the same length as the input. This keeps errors, which will be removed later in the process. Default is 4.
+<code>corMinCoverage</code> this governs the correction: if set to 0, the correction is non-lossy such that the output is the same length as the input. This keeps errors, which will be removed later in the process. Default is 4.
-**corOutCoverage**</code>: this can have an integer value or 'all'. A higher than-your-total input coverage or All forces canu to correct and assemble all the input. Takes much longer to run.Default is 40.
+<code>corOutCoverage</code>: this can have an integer value or 'all'. A higher than-your-total input coverage or All forces canu to correct and assemble all the input. Takes much longer to run.Default is 40.
-<code>**corMhapSensitivity**</code>: Settings: low/normal/high. Governs how sensitivity Mhap is, in a rough fashion. It should be set based on the read coverage: >60 = low, 60 to 30 = normal, <30 = high
+<code>corMhapSensitivity</code>: Settings: low/normal/high. Governs how sensitivity Mhap is, in a rough fashion. It should be set based on the read coverage: >60 = low, 60 to 30 = normal, <30 = high
-**corMaxEvidenceErate**: Setting this limits read correction to only overlaps at or below this value (between 0 and 1). Default is unlimited.
+<code>corMaxEvidenceErate</code> Setting this limits read correction to only overlaps at or below this value (between 0 and 1). Default is unlimited.
-**correctedErrorRate**: Setting is a fraction (between 0 and 1). Default depends on the technology used: 0.045 for PacBio, or 0.144 for Nanopore reads. It is recommended that for low coverage data, that the default value should be increased by 1%, and for high coverage, decreased by 1%.
+<code>correctedErrorRate</code> Setting is a fraction (between 0 and 1). Default depends on the technology used: 0.045 for PacBio, or 0.144 for Nanopore reads. It is recommended that for low coverage data, that the default value should be increased by 1%, and for high coverage, decreased by 1%.
-**genomeSize**: This is the only required parameter. It has no default value, and is based on the size of the genome you're working with.
+<code>genomeSize</code> This is the only **required parameter**. It has no default value, and is based on the size of the genome you're working with.
-**corMaxEvidenceCoverageGlobal**: Limits reads used for correction to support at most this coverage. Default is 1.0 times estimated coverage
+<code>corMaxEvidenceCoverageGlobal</code> Limits reads used for correction to support at most this coverage. Default is 1.0 times estimated coverage
-**corMaxEvidenceCoverageLocal**: limits reads being corrected to at most this much evidence coverage. Default is 10 times estimated coverage.
+<code>corMaxEvidenceCoverageLocal</code> limits reads being corrected to at most this much evidence coverage. Default is 10 times estimated coverage.
-**[prefix]Memory**: the memory limit, Set in gigabytes, and by default is unset. prefixes are used to set the memory limits for specific tasks:
+<code>[prefix]Memory</code> the memory limit, Set in gigabytes, and by default is unset. prefixes are used to set the memory limits for specific tasks:
 Prefix:
-//oea:// error adjustment in overlaps
+<code>oea</code> error adjustment in overlaps
-//red:// error detection in reads
+<code>red</code>error detection in reads
-//bat:// unitig/contig construction
+<code>bat</code> unitig/contig construction
 A fuller list can be found here: [[https://canu.readthedocs.io/en/latest/parameter-reference.html#mhapsensitivity|parameter-reference]]