Differences

This shows you the differences between two versions of the page.

--- cleaning_of_illumina_paired_end_reads [2019/09/19 23:37] – 170.10.235.116
+++ cleaning_of_illumina_paired_end_reads [2021/12/16 15:07] (current) – 170.10.250.122
@@ Line 1: / Line 1: @@
-**Basic Usage of Trimmomatic**\\
+Updated by Dandan Zhao and D. Salas-Leiva (2020-12-15)
+**1. Quality control: FastQC**\\
+FastQC reads fastq file and produces a quality control report consisting of the different modules that are very useful for cleaning the reads:\\
+• Basic Statistics: Total Sequences, Sequences flagged as poor quality, Sequence length,%GC\\
+• Per base sequence quality \\
+• Per sequence quality scores\\
+• Per base sequence content\\
+• Per sequence GC content\\
+• Per base N content\\
+• Sequence Length Distribution\\
+• Overrepresented sequences\\
+• Adapter Content\\
+input can be fastq or bam/sam file, output is a html report and a compressed report file.\\
+<code>
+ fastqc [-o output_dir] [-f fastq|bam|sam] seqfile
+or
+ fastqc seqfile1 seqfile2 .. seqfileN
+</code>
+**2. Remove adapters: Trimmomatic**\\
 The shell script below will trim/clean up Illumina paired reads (for options see http://www.usadellab.org/cms/?page=trimmomatic).\\
-If you have your own set of adapters, let's say in the following path (///path/to/userdir/Adapters.fas//), you need to place such information after you have declared the Paired and Unpaired file names <ILLUMINACLIP:/path/to/userdir/Adapters.fas:2:30:10>. If you don't do this (because you don't know which adapters were used for the library), don't worry... you can use the script as it is, since trimmomatic will use its own Illumina provided adapters and search them (All adapters by default will be searched)\\  Do not forget to provide your directories names when __/path/to/userdir/__ appears
+If you have your own set of adapters, let's say in the following path (///path/to/userdir/Adapters.fas//), you need to place such information after you have declared the Paired and Unpaired file names <ILLUMINACLIP:/path/to/userdir/Adapters.fas:2:30:10>. If you don't do this (because you don't know which adapters were used for the library), don't worry... you can use the script as it is, since trimmomatic will use its own Illumina provided adapters and search them (All adapters by default will be searched)\\  Do not forget to provide your directories names when __/path/to/userdir/__ appears, or use cd $PWD if the shell is in the same fold containing the input files.\\
 //input can be fastq (decompressed or compressed as gz), output will be decompressed by default.\\
 always check if there is a newer update of Trimmomatic available in perun.//
-== shell script: do no leave spaces between lines
+== shell script:
-Paired End Mode:\\
 <code>
 #!/bin/bash
-#$ -S /bin/sh
+#$ -S /bin/bash
 . /etc/profile
 #$ -cwd
-#$ -pe threaded 10
+#$ -pe threaded 20
-cd /path/to/userdir
-java -jar /opt/perun/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 10 -phred33 -trimlog LOG_NAME.out READS_R1.fastq READS_R2.fastq READS_R1_PairNtrim.fq READS_R1_unPairNtrim.fq READS_R2_PairNtrim.fq READS_R2_unPairNtrim.fq HEADCROP:20 LEADING:10 TRAILING:10 SLIDINGWINDOW:10:25 MINLEN:40
+cd $PWD
+R1=/abspath/to/reads.R1.fastq.gz
+R2=/abspath/to/reads.R2.fastq.gz
+basename=my_bug
+# use either your own adapters or the comprehensive adapter list in the path below:
+adap_path=/home/dsalas/anvioWrap/TrueSeq2_NexteraSE-PE.fa
+# trimmomatic version is 0.39 (latest release)
+source activate trimmomatic
+trimmomatic PE -threads 20 -phred33 -trimlog $basename\.log $R1 $R2 $basename\.1.PT.fq $basename\.1.unPT.fq $basename\.2.PT.fq $basename\.2.unPT.fq ILLUMINACLIP:$adap_path:2:30:10 HEADCROP:15 LEADING:20 TRAILING:20 SLIDINGWINDOW:40:25 MINLEN:40
+conda deactivate
 </code>
-Single End Mode:\\
+This will perform the following:\\
+• HEADCROP: Cut the specified number (20) of bases from the start of the read. (HEADCROP:15)\\
+• LEADING: Remove bases at the start of a read if the quality is below quality 20. (LEADING:20)\\
+• TRAILING: Remove bases at the end of a read if the quality is below quality 20. (TRAILING:20)\\
+• SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. Scan the read with a 10-base wide sliding window, cutting when the average quality per base drops below 25 (SLIDINGWINDOW:10:25)\\
+• MINLEN: Drop the read if it is below a specified length (below 40) (MINLEN:40).\\
+• -phred33: Quality scores in Phred33 format. [https://en.wikipedia.org/wiki/FASTQ_format]\\
+• ILLUMINACLIP: Remove adapters <ILLUMINACLIP:/path/to/userdir/Adapters.fas:2:30:10> (/path/to/userdir/Adapters.fas, adapter file).\\
+  ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>
+  *fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used. See below.
+  *seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
+  *palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
+  *simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read.
+Parameters listed here are a little strict. You can adjust them based on your demand. To test the quality of the read, use fastqc first to get a quality control report. (More details about Trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic)\\
+**3. Remove host reads: bowtie2**\\
+Step1. Download host reference genome (https://www.ncbi.nlm.nih.gov/genome/)\\
+Step2. Use //bowtie2-build// to build a Bowtie index from a set of DNA sequences. bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2.\\
+== shell script:\\
 <code>
 #!/bin/bash
@@ Line 25: / Line 79: @@
 . /etc/profile
 #$ -cwd
-#$ -pe threaded 10
+cd $PWD
-cd /path/to/userdir
+/opt/perun/bin/bowtie2-build -f Host_ref_genome.fna Host_ref_genome
-java -jar /opt/perun/Trimmomatic-0.36/trimmomatic-0.36.jar SE -threads 10 -phred33 -trimlog LOG_NAME.out READS.fastq READS_Ntrim.fq HEADCROP:20 LEADING:10 TRAILING:10 SLIDINGWINDOW:10:25 MINLEN:40
 </code>
-This will perform the following:\\
+Step3. Mapping the metagenome to reference genome (host) and write paired-end reads that fail to align to output file by Bowtie 2 (for options see http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml).\\
-• HEADCROP: Cut the specified number (20) of bases from the start of the read.\\
-• LEADING: Remove bases in the start of a read if the quality is below quality 10. (LEADING:10)\\
-• TRAILING: Remove bases in the end of a read if the quality is below quality 10. (TRAILING:10)\\
-• SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. Scan the read with a 10-base wide sliding window, cutting when the average quality per base drops below 25 (SLIDINGWINDOW:10:25)\\
-• MINLEN: Drop the read if it is below a specified length (below 40) (MINLEN:40).\\
-• -phred33: Quality scores in Phred33 format. [https://en.wikipedia.org/wiki/FASTQ_format]\\
-• ILLUMINACLIP: Remove adapters <ILLUMINACLIP:/path/to/userdir/Adapters.fas:2:30:10> (/path/to/userdir/Adapters.fas, adapter file). \\
+Input need decompressed fastq files. Output files:
+SRR1747060.sam (File with SAM alignment infomation )
+SRR1747060_bowtie2.1.fastq
+SRR1747060_bowtie2.2.fastq
+SRR1747060_bowtie2_un.fastq
-Parameters listed here are a little strict. You can adjust them based on your demand. To test the quality of the read, use fastqc first to get a quality control report.\\
+== shell script:\\
+<code>
+#!/bin/sh
+#$ -S /bin/sh
+. /etc/profile
+#$ -o bowtie2.log
+#$ -cwd
+#$ -pe threaded 10
+bowtie2 --threads 10 -x Host_ref_genome -1 READS_R1_PairNtrim.fq -2 READS_R2_PairNtrim.fq -U READS_R1_unPairNtrim.fq,READS_R2_unPairNtrim.fq -S SRR1747060.sam --un-conc READS_bowtie2.fastq --un READS_bowtie2_un.fastq
+</code>