Differences

This shows you the differences between two versions of the page.

--- mapping_rnaseq_data_to_your_genome [2024/02/20 11:29] – [Mapping RNAseq data to your genome] 134.190.232.164
+++ mapping_rnaseq_data_to_your_genome [2024/10/25 15:20] (current) – [What the script does] 134.190.144.194
@@ Line 1: / Line 1: @@
 ====== Mapping RNAseq data to your genome ======
-Initially written by I don't know who. Updated by Joran Martijn, December 2023
+Initially written by I don't know who. Updated by Joran Martijn, December 2023. (Minor update, KW, Oct 2024)
@@ Line 8: / Line 8: @@
 Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that
 can properly deal with spliced alignments. [[https://ccb.jhu.edu/software/tophat/index.shtml|Tophat2]] is widely used and cited. However, it has now been superseded by [[http://ccb.jhu.edu/software/hisat2/index.shtml|HISAT2]]. Its manual can be found [[http://ccb.jhu.edu/software/hisat2/manual.shtml|here]].
+===== Mapping Illumina data =====
 HISAT2 was developed with Illumina type reads in mind. It may thus not be the optimal mapping technology for older technologies such as Sanger and 454.
-===== The perun shell script =====
+==== The perun shell script ====
 The latest Perun shell submission script template can be found on the [[https://github.com/RogerLab/gospel_of_andrew/blob/main/perun_scripts/run_hisat2_rnaseq.sh|RogerLab GitHub page]]. Here is a copy paste of what it looks like as of December 2023:
@@ Line 55: / Line 57: @@
 To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with ''qsub run_hisat2_rnaseq.sh''.
-===== What the script does =====
+==== What the script does ====
 The first program that is called is ''hisat2-build'', and it builds an index of the genome assembly. The mapping algorithm invoked by ''hisat2'' in the next line relies on this index. Other mapping algorithms such as ''bwa'' and ''bowtie'' use similar strategies.
@@ Line 111: / Line 113: @@
 # indicate one or more files containing unpaired reads
 -U forward_unpaired.fastq,reverse_unpaired.fastq
+# set the maximum (MX) and minimum (MN) mismatch penalties. Default: MX = 6, MN = 2.
+# NOTE: you may have to increase this to get the optimal mapping of a read across an intron!
+# I have noticed this is not always the case with the default penalty settings,
+# so it is good to check -Kelsey Williamson
+--mp MX,MN
+# set the penalty for non-canonical splice sites (non-GT/AG) - default is 12
+# so if your genome uses non-canonical splice sites, you want to set this to 0
+--pen-noncansplice <int>
 </code>
 Note that in this script, the hisat2 output is directly fed into ''samtools sort'' which then generates a sorted BAM file immediately. The formation of a SAM file is thus skipped. This is beneficial because it reduces computation time, and the number of samtools commands to run. A SAM file is essentially unnecessary once you have the BAM file anyway. However, if you still wish to generate the SAM file, you can remove the pipe-into-samtools line and replace it with another option for hisat2: ''-S <output.sam>''
-===== Stranded information =====
+==== Stranded information ====
 If you used HISAT2 with ''--rna-strandness RF'' to map your RNAseq reads to your genome, all "positive" reads will have the tag ''XS:A:+'', while all "negative" reads will have the ''XS:A:-'' tag. One can separate the two types of reads from the HISAT2-generated BAM file using the following command:
@@ Line 128: / Line 140: @@
 If you did NOT run ''--rna-strandness RF'' but still attempt to do the separation based on the XS tag, something odd will happen: only read mappings that have spliced alignment (i.e. those reads that overlap with introns) will be extracted from the original BAM file. This can lead to volcano-like expression patterns when visualizing the RNAseq coverage in IGV.
+===== Mapping Sanger and 454 data =====
+It may be that you have to retrieve some older Sanger and 454 sequencing data and want to map it to a reference genome. I'm not sure whether HISAT2 is really unfit for longer reads (Sanger: ~1000 bp, 454: ~500 bp), but I've had some sensible results using [[https://github.com/lh3/minimap2|Minimap2]]. Minimap2 is mostly known for mapping nanopore DNAseq to reference genomes, but it is in fact very versatile and can also work with RNAseq data with the right parameters.
+==== The perun shell script ====
+The latest Perun shell submission script template can be found on the [[https://github.com/RogerLab/gospel_of_andrew/blob/main/perun_scripts/run_hisat2_rnaseq.sh|RogerLab GitHub page]]. Here is the relevant section of the script for RNAseq mapping February 2024:
+<code>
+#!/bin/bash
+#$ -S /bin/bash
+. /etc/profile
+#$ -cwd
+#$ -pe threaded 10
+#$ -q 256G-batch
+# input
+READS='eb_454_rna.SRR3734828.fastq.gz'
+ASSEMBLY='Ergobibamus_cyprinoides_CL.scaffolds.fa'
+# parameters
+THREADS=10
+# output
+BAM='454_reads_vs_ergo.minimap2.bam'
+source activate minimap2
+# options
+# -a                    Outputs in SAM format and generates CIGAR strings
+# -x                    Presets (i.e. nanopore, pacbio, short reads etc). Presets are preconfigured sets of options
+# -t                    Number of Threads
+# define these options after setting the preset, so they overwrite the overlapping option in the preset
+# --secondary=no        If you don't want to report secondary alignments (default = yes)
+# -G<n>                 Maximum gap size in read relative to reference. Relevant with --splice or 'splice' preset (default = 200k). Essentially max expected intron size
+# --splice-flank=yes    Assumes the next base after a GT donor site is an A or G.
+#                       And that the base immediately before a AG acceptor site is a C or T.
+#                       May be better to set to =no for exotic protists?
+# if you want to generate spliced alignments
+# because you have long read transcriptome data
+minimap2 \
+    -ax splice $ASSEMBLY $READS \
+    -t $THREADS -G250 \
+    | samtools view -bS - | samtools sort -o $BAM -
+samtools index $BAM
+</code>
+To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with ''qsub run_minimap2.sh''.
+Minimap2 has a rich set of parameters that you can finely tune (see also its [[https://lh3.github.io/minimap2/minimap2.html|man]]) but honestly, most of it beyond me. Thankfully Leng Hi has created some presets, which are essentially a preconfigured set of parameters fit for a particular commonly used task. For example, the ''map-ont'' preset (for mapping DNAseq Oxford Nanopore data) uses the default values for all parameters, whereas the ''map-hifi'' (for mapping DNAseq PacBio HiFi data) uses ''-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200''.
+For mapping long read RNAseq data, we can use the preset ''splice'', which is short for the following parameter settings: ''-k15 -w5 --splice -g2k -G200k -A1 -B2 -O2,32 -E1,0 -b0 -C9 -z200 -ub --junc-bonus=9 --cap-sw-mem=0 --splice-flank=yes''. The splice preset also ensures that introns in read mappings are properly denoted with the 'N' symbol rather than the 'D' (for deletion) symbol. For example, 200N in the CIGAR string would indicate an intron of 200 bp in the aligned read. This also makes it nicer to view RNAseq alignments in IGV or Tablet.
+I think these parameters are particularly tuned for Human / Mouse genomes because ''-G200k'', as far as I understand, indicates that you can expect introns up to 200k in length, and the ''--splice-flank=yes'' option assumes that the base immediately following a **GT donor site** is an A or G, and the base immediately preceding the **AG acceptor site** is a C or T. So you may want to overwrite these settings by invoking these settings after setting the preset:
+<code>
+minimap2 -ax splice $ASSEMBLY $READS -t $THREADS -G250 --splice-flank=no
+</code>