Differences

This shows you the differences between two versions of the page.

--- mapping_rnaseq_data_to_your_genome [2018/01/10 13:27] – 129.173.90.108
+++ mapping_rnaseq_data_to_your_genome [2024/10/25 15:20] (current) – [What the script does] 134.190.144.194
@@ Line 1: / Line 1: @@
+====== Mapping RNAseq data to your genome ======
+Initially written by I don't know who. Updated by Joran Martijn, December 2023. (Minor update, KW, Oct 2024)
 For various reasons, at some point, you will want to map your RNAseq data onto your assembled genome.
 Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that
-can properly deal with spliced alignments. Tophat2 is widely used and cited  https://ccb.jhu.edu/software/tophat/index.shtml but has now been superseded by HISAT2 http://ccb.jhu.edu/software/hisat2/index.shtml
+can properly deal with spliced alignments. [[https://ccb.jhu.edu/software/tophat/index.shtml|Tophat2]] is widely used and cited. However, it has now been superseded by [[http://ccb.jhu.edu/software/hisat2/index.shtml|HISAT2]]. Its manual can be found [[http://ccb.jhu.edu/software/hisat2/manual.shtml|here]].
-Before you can map your RNAseq reads to your genome you need to build an index of the genome assembly.
+===== Mapping Illumina data =====
-In a shell script that you can qsub include the following line
+HISAT2 was developed with Illumina type reads in mind. It may thus not be the optimal mapping technology for older technologies such as Sanger and 454.
-///scratch2/software/hisat2-2.1.0/hisat2-build -f name_of_genome_assembly.fasta root_for_index//
+==== The perun shell script ====
-This will create a series (depending on the size of the genome) of files with the root name eg
+The latest Perun shell submission script template can be found on the [[https://github.com/RogerLab/gospel_of_andrew/blob/main/perun_scripts/run_hisat2_rnaseq.sh|RogerLab GitHub page]]. Here is a copy paste of what it looks like as of December 2023:
-pce_p3x.1.ht2
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -pe threaded 20
-pce_p3x.2.ht2
+# input
+base='/scratch3/rogerlab_databases/Fastq_files/Ergobibamus/Illumina/RNAseq/trimmomatic-0.39'
+R1=${base}/eb_rna_fw.prd.fastq.gz
+R2=${base}/eb_rna_rv.prd.fastq.gz
+U1=${base}/eb_rna_fw.unp.fastq.gz
+U2=${base}/eb_rna_rv.unp.fastq.gz
+ASSEMBLY='ergo_cyp_genome.fasta.masked'
-pce_p3x.3.ht2
+# output
+PREFIX=rnaseq_vs_masked_ergo_genome
+INDEX=masked_ergo_genome
-Once the indices have been built you can proceed with the actual mapping.
+# load hisat2 software
-There are many options available http://ccb.jhu.edu/software/hisat2/manual.shtml but for our purposes we will
+source activate hisat2
-restrict them to the bare essentials.
+# align rnaseq reads
+hisat2-build -f $ASSEMBLY $INDEX
+## q: quiet, k: how many a single read can map
+hisat2 \
+    -q --threads 20 --phred33 \
+    --max-intronlen 1000 \
+    --rna-strandness RF -k 2 \
+    -x $INDEX -1 $R1 -2 $R2 -U $U1,$U2 | \
+	samtools sort --threads 20 -o $PREFIX.sort.bam -
+samtools index -@ 20 $PREFIX.sort.bam $PREFIX.sort.bam.bai
+conda deactivate
+</code>
+To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with ''qsub run_hisat2_rnaseq.sh''.
+==== What the script does ====
+The first program that is called is ''hisat2-build'', and it builds an index of the genome assembly. The mapping algorithm invoked by ''hisat2'' in the next line relies on this index. Other mapping algorithms such as ''bwa'' and ''bowtie'' use similar strategies.
+Note that ''hisat2-build'' is called here with the ''-f'' option. I actually have no idea what that means. There is no documentation on this option, but it was originally written on this LabWiki to use that option, and it doesn't seem to break anything and I'm getting desireable results, so I'm just leaving it in.
+The indexing will create a series of files with the root name eg pce_p3x. The larger your assembly, the more files it creates.
+  * pce_p3x.1.ht2
+  * pce_p3x.2.ht2
+  * pce_p3x.3.ht2
+The ''hisat2'' options:
-A typical shell script to qsub would look like this
 <code>
-#!/bin/sh
-#$ -o logo_p3x_hisat
-#$ -cwd
-#$ -pe threaded 6
-/scratch2/software/hisat2-2.1.0/hisat2  -q -p 6 --phred33 --max-intronlen 1000 -k 2 -x pce_p3x -1 PCE_RNA_1_PairNtrim.fastq -2 PCE_RNA_2_PairNtrim.fastq -S pce_3x.sam
+# indicate that the reads are in FASTQ format
+# use -f if reads are in FASTA format
+-q
+# number of threads to use
+# equivalent to --threads 6
+-p 6
+# indicate the scoring scheme for the
+# per-base quality scores encoded in the FASTQ file
+--phred33
+# indicate that you believe that the longest real intron is 1000 bps
+# (default is 50,000)
+--max-intronlen 1000
+# indicate that your library is stranded
+# i.e. you know for each read from whether it came from
+# the positive or negative strand of the genome
+# (if its not stranded do not invoke this option)
+--rna-strandness RF
+# indicate how many distinct primary alignments
+# to search for for each read
+-k 2
+# indicate the root name
+# of the index you generated
+# with hisat2-build
+-x pce_p3x
+# indicate the file for the forward (/1) reads
+-1 PCE_RNA_1_PairNtrim.fastq
+# indicate the file for the reverse (/2) reads
+-2 PCE_RNA_2_PairNtrim.fastq
+# indicate one or more files containing unpaired reads
+-U forward_unpaired.fastq,reverse_unpaired.fastq
+# set the maximum (MX) and minimum (MN) mismatch penalties. Default: MX = 6, MN = 2.
+# NOTE: you may have to increase this to get the optimal mapping of a read across an intron!
+# I have noticed this is not always the case with the default penalty settings,
+# so it is good to check -Kelsey Williamson
+--mp MX,MN
+# set the penalty for non-canonical splice sites (non-GT/AG) - default is 12
+# so if your genome uses non-canonical splice sites, you want to set this to 0
+--pen-noncansplice <int>
 </code>
-where
+Note that in this script, the hisat2 output is directly fed into ''samtools sort'' which then generates a sorted BAM file immediately. The formation of a SAM file is thus skipped. This is beneficial because it reduces computation time, and the number of samtools commands to run. A SAM file is essentially unnecessary once you have the BAM file anyway. However, if you still wish to generate the SAM file, you can remove the pipe-into-samtools line and replace it with another option for hisat2: ''-S <output.sam>''
--q indicates that the reads are in fastq format versus -f for fasta
+==== Stranded information ====
--p 6 indicates how many threads to use
+If you used HISAT2 with ''--rna-strandness RF'' to map your RNAseq reads to your genome, all "positive" reads will have the tag ''XS:A:+'', while all "negative" reads will have the ''XS:A:-'' tag. One can separate the two types of reads from the HISAT2-generated BAM file using the following command:
---phred33 indicates that scoring scheme for the quality scores
+<code>
+samtools view -b --tag XS:+ <rnaseq.bam> > <rnaseq.pos.bam>
+samtools index <rnaseq.pos.bam> <rnaseq.pos.bam.bai>
---max-intronlen 1000 indicates that you believe that the longest real intron is 1000 bps (default is 50,000)
+samtools view -b --tag XS:- <rnaseq.bam> > <rnaseq.neg.bam>
+samtools index <rnaseq.neg.bam> <rnaseq.neg.bam.bai>
+</code>
--k 2 indicates how many distinct primary alignments to search for for each read
+If you did NOT run ''--rna-strandness RF'' but still attempt to do the separation based on the XS tag, something odd will happen: only read mappings that have spliced alignment (i.e. those reads that overlap with introns) will be extracted from the original BAM file. This can lead to volcano-like expression patterns when visualizing the RNAseq coverage in IGV.
--x pce_p3x is the index root name
+===== Mapping Sanger and 454 data =====
--1 PCE_RNA_1_PairNtrim.fastq indicates the file for the forward (/1) reads
+It may be that you have to retrieve some older Sanger and 454 sequencing data and want to map it to a reference genome. I'm not sure whether HISAT2 is really unfit for longer reads (Sanger: ~1000 bp, 454: ~500 bp), but I've had some sensible results using [[https://github.com/lh3/minimap2|Minimap2]]. Minimap2 is mostly known for mapping nanopore DNAseq to reference genomes, but it is in fact very versatile and can also work with RNAseq data with the right parameters.
--2 PCE_RNA_2_PairNtrim.fastq indicates the file for the reverse (/2) reads
+==== The perun shell script ====
--S pce_3x.sam indicates the name for the SAM output file
+The latest Perun shell submission script template can be found on the [[https://github.com/RogerLab/gospel_of_andrew/blob/main/perun_scripts/run_hisat2_rnaseq.sh|RogerLab GitHub page]]. Here is the relevant section of the script for RNAseq mapping February 2024:
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -pe threaded 10
+#$ -q 256G-batch
+# input
+READS='eb_454_rna.SRR3734828.fastq.gz'
+ASSEMBLY='Ergobibamus_cyprinoides_CL.scaffolds.fa'
+# parameters
+THREADS=10
+# output
+BAM='454_reads_vs_ergo.minimap2.bam'
+source activate minimap2
+# options
+# -a                    Outputs in SAM format and generates CIGAR strings
+# -x                    Presets (i.e. nanopore, pacbio, short reads etc). Presets are preconfigured sets of options
+# -t                    Number of Threads
+# define these options after setting the preset, so they overwrite the overlapping option in the preset
+# --secondary=no        If you don't want to report secondary alignments (default = yes)
+# -G<n>                 Maximum gap size in read relative to reference. Relevant with --splice or 'splice' preset (default = 200k). Essentially max expected intron size
+# --splice-flank=yes    Assumes the next base after a GT donor site is an A or G.
+#                       And that the base immediately before a AG acceptor site is a C or T.
+#                       May be better to set to =no for exotic protists?
+# if you want to generate spliced alignments
+# because you have long read transcriptome data
+minimap2 \
+    -ax splice $ASSEMBLY $READS \
+    -t $THREADS -G250 \
+    | samtools view -bS - | samtools sort -o $BAM -
+samtools index $BAM
+</code>
+To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with ''qsub run_minimap2.sh''.
+Minimap2 has a rich set of parameters that you can finely tune (see also its [[https://lh3.github.io/minimap2/minimap2.html|man]]) but honestly, most of it beyond me. Thankfully Leng Hi has created some presets, which are essentially a preconfigured set of parameters fit for a particular commonly used task. For example, the ''map-ont'' preset (for mapping DNAseq Oxford Nanopore data) uses the default values for all parameters, whereas the ''map-hifi'' (for mapping DNAseq PacBio HiFi data) uses ''-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200''.
+For mapping long read RNAseq data, we can use the preset ''splice'', which is short for the following parameter settings: ''-k15 -w5 --splice -g2k -G200k -A1 -B2 -O2,32 -E1,0 -b0 -C9 -z200 -ub --junc-bonus=9 --cap-sw-mem=0 --splice-flank=yes''. The splice preset also ensures that introns in read mappings are properly denoted with the 'N' symbol rather than the 'D' (for deletion) symbol. For example, 200N in the CIGAR string would indicate an intron of 200 bp in the aligned read. This also makes it nicer to view RNAseq alignments in IGV or Tablet.
+I think these parameters are particularly tuned for Human / Mouse genomes because ''-G200k'', as far as I understand, indicates that you can expect introns up to 200k in length, and the ''--splice-flank=yes'' option assumes that the base immediately following a **GT donor site** is an A or G, and the base immediately preceding the **AG acceptor site** is a C or T. So you may want to overwrite these settings by invoking these settings after setting the preset:
+<code>
+minimap2 -ax splice $ASSEMBLY $READS -t $THREADS -G250 --splice-flank=no
+</code>