====== Mapping RNAseq data to your genome ======

Initially written by I don't know who. Updated by Joran Martijn, December 2023. (Minor update, KW, Oct 2024)


For various reasons, at some point, you will want to map your RNAseq data onto your assembled genome.

Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that
can properly deal with spliced alignments. [[https://ccb.jhu.edu/software/tophat/index.shtml|Tophat2]] is widely used and cited. However, it has now been superseded by [[http://ccb.jhu.edu/software/hisat2/index.shtml|HISAT2]]. Its manual can be found [[http://ccb.jhu.edu/software/hisat2/manual.shtml|here]].

===== Mapping Illumina data =====

HISAT2 was developed with Illumina type reads in mind. It may thus not be the optimal mapping technology for older technologies such as Sanger and 454.

==== The perun shell script ====

The latest Perun shell submission script template can be found on the [[https://github.com/RogerLab/gospel_of_andrew/blob/main/perun_scripts/run_hisat2_rnaseq.sh|RogerLab GitHub page]]. Here is a copy paste of what it looks like as of December 2023:

<code>
#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20

# input
base='/scratch3/rogerlab_databases/Fastq_files/Ergobibamus/Illumina/RNAseq/trimmomatic-0.39'
R1=${base}/eb_rna_fw.prd.fastq.gz
R2=${base}/eb_rna_rv.prd.fastq.gz
U1=${base}/eb_rna_fw.unp.fastq.gz
U2=${base}/eb_rna_rv.unp.fastq.gz
ASSEMBLY='ergo_cyp_genome.fasta.masked'

# output
PREFIX=rnaseq_vs_masked_ergo_genome
INDEX=masked_ergo_genome

# load hisat2 software
source activate hisat2

# align rnaseq reads
hisat2-build -f $ASSEMBLY $INDEX

## q: quiet, k: how many a single read can map
hisat2 \
    -q --threads 20 --phred33 \
    --max-intronlen 1000 \
    --rna-strandness RF -k 2 \
    -x $INDEX -1 $R1 -2 $R2 -U $U1,$U2 | \
	samtools sort --threads 20 -o $PREFIX.sort.bam -

samtools index -@ 20 $PREFIX.sort.bam $PREFIX.sort.bam.bai

conda deactivate
</code>

To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with ''qsub run_hisat2_rnaseq.sh''.

==== What the script does ====

The first program that is called is ''hisat2-build'', and it builds an index of the genome assembly. The mapping algorithm invoked by ''hisat2'' in the next line relies on this index. Other mapping algorithms such as ''bwa'' and ''bowtie'' use similar strategies.

Note that ''hisat2-build'' is called here with the ''-f'' option. I actually have no idea what that means. There is no documentation on this option, but it was originally written on this LabWiki to use that option, and it doesn't seem to break anything and I'm getting desireable results, so I'm just leaving it in.

The indexing will create a series of files with the root name eg pce_p3x. The larger your assembly, the more files it creates.

  * pce_p3x.1.ht2
  * pce_p3x.2.ht2
  * pce_p3x.3.ht2


The ''hisat2'' options:

<code>

# indicate that the reads are in FASTQ format
# use -f if reads are in FASTA format
-q 

# number of threads to use
# equivalent to --threads 6
-p 6 

# indicate the scoring scheme for the 
# per-base quality scores encoded in the FASTQ file
--phred33 

# indicate that you believe that the longest real intron is 1000 bps 
# (default is 50,000)
--max-intronlen 1000

# indicate that your library is stranded
# i.e. you know for each read from whether it came from 
# the positive or negative strand of the genome
# (if its not stranded do not invoke this option)
--rna-strandness RF

# indicate how many distinct primary alignments 
# to search for for each read
-k 2 

# indicate the root name
# of the index you generated
# with hisat2-build
-x pce_p3x 

# indicate the file for the forward (/1) reads
-1 PCE_RNA_1_PairNtrim.fastq 

# indicate the file for the reverse (/2) reads
-2 PCE_RNA_2_PairNtrim.fastq

# indicate one or more files containing unpaired reads
-U forward_unpaired.fastq,reverse_unpaired.fastq

# set the maximum (MX) and minimum (MN) mismatch penalties. Default: MX = 6, MN = 2.
# NOTE: you may have to increase this to get the optimal mapping of a read across an intron! 
# I have noticed this is not always the case with the default penalty settings, 
# so it is good to check -Kelsey Williamson
--mp MX,MN

# set the penalty for non-canonical splice sites (non-GT/AG) - default is 12
# so if your genome uses non-canonical splice sites, you want to set this to 0
--pen-noncansplice <int>
</code>

Note that in this script, the hisat2 output is directly fed into ''samtools sort'' which then generates a sorted BAM file immediately. The formation of a SAM file is thus skipped. This is beneficial because it reduces computation time, and the number of samtools commands to run. A SAM file is essentially unnecessary once you have the BAM file anyway. However, if you still wish to generate the SAM file, you can remove the pipe-into-samtools line and replace it with another option for hisat2: ''-S <output.sam>''

==== Stranded information ====

If you used HISAT2 with ''--rna-strandness RF'' to map your RNAseq reads to your genome, all "positive" reads will have the tag ''XS:A:+'', while all "negative" reads will have the ''XS:A:-'' tag. One can separate the two types of reads from the HISAT2-generated BAM file using the following command:

<code>
samtools view -b --tag XS:+ <rnaseq.bam> > <rnaseq.pos.bam>
samtools index <rnaseq.pos.bam> <rnaseq.pos.bam.bai>

samtools view -b --tag XS:- <rnaseq.bam> > <rnaseq.neg.bam>
samtools index <rnaseq.neg.bam> <rnaseq.neg.bam.bai>
</code>

If you did NOT run ''--rna-strandness RF'' but still attempt to do the separation based on the XS tag, something odd will happen: only read mappings that have spliced alignment (i.e. those reads that overlap with introns) will be extracted from the original BAM file. This can lead to volcano-like expression patterns when visualizing the RNAseq coverage in IGV.

===== Mapping Sanger and 454 data =====

It may be that you have to retrieve some older Sanger and 454 sequencing data and want to map it to a reference genome. I'm not sure whether HISAT2 is really unfit for longer reads (Sanger: ~1000 bp, 454: ~500 bp), but I've had some sensible results using [[https://github.com/lh3/minimap2|Minimap2]]. Minimap2 is mostly known for mapping nanopore DNAseq to reference genomes, but it is in fact very versatile and can also work with RNAseq data with the right parameters.

==== The perun shell script ====

The latest Perun shell submission script template can be found on the [[https://github.com/RogerLab/gospel_of_andrew/blob/main/perun_scripts/run_hisat2_rnaseq.sh|RogerLab GitHub page]]. Here is the relevant section of the script for RNAseq mapping February 2024:

<code>
#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 10
#$ -q 256G-batch

# input
READS='eb_454_rna.SRR3734828.fastq.gz'
ASSEMBLY='Ergobibamus_cyprinoides_CL.scaffolds.fa'

# parameters
THREADS=10

# output
BAM='454_reads_vs_ergo.minimap2.bam'

source activate minimap2

# options
# -a                    Outputs in SAM format and generates CIGAR strings
# -x                    Presets (i.e. nanopore, pacbio, short reads etc). Presets are preconfigured sets of options
# -t                    Number of Threads

# define these options after setting the preset, so they overwrite the overlapping option in the preset
# --secondary=no        If you don't want to report secondary alignments (default = yes)
# -G<n>                 Maximum gap size in read relative to reference. Relevant with --splice or 'splice' preset (default = 200k). Essentially max expected intron size
# --splice-flank=yes    Assumes the next base after a GT donor site is an A or G.
#                       And that the base immediately before a AG acceptor site is a C or T.
#                       May be better to set to =no for exotic protists?

# if you want to generate spliced alignments
# because you have long read transcriptome data
minimap2 \
    -ax splice $ASSEMBLY $READS \
    -t $THREADS -G250 \
    | samtools view -bS - | samtools sort -o $BAM -

samtools index $BAM
</code>

To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with ''qsub run_minimap2.sh''. 

Minimap2 has a rich set of parameters that you can finely tune (see also its [[https://lh3.github.io/minimap2/minimap2.html|man]]) but honestly, most of it beyond me. Thankfully Leng Hi has created some presets, which are essentially a preconfigured set of parameters fit for a particular commonly used task. For example, the ''map-ont'' preset (for mapping DNAseq Oxford Nanopore data) uses the default values for all parameters, whereas the ''map-hifi'' (for mapping DNAseq PacBio HiFi data) uses ''-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200''.

For mapping long read RNAseq data, we can use the preset ''splice'', which is short for the following parameter settings: ''-k15 -w5 --splice -g2k -G200k -A1 -B2 -O2,32 -E1,0 -b0 -C9 -z200 -ub --junc-bonus=9 --cap-sw-mem=0 --splice-flank=yes''. The splice preset also ensures that introns in read mappings are properly denoted with the 'N' symbol rather than the 'D' (for deletion) symbol. For example, 200N in the CIGAR string would indicate an intron of 200 bp in the aligned read. This also makes it nicer to view RNAseq alignments in IGV or Tablet.

I think these parameters are particularly tuned for Human / Mouse genomes because ''-G200k'', as far as I understand, indicates that you can expect introns up to 200k in length, and the ''--splice-flank=yes'' option assumes that the base immediately following a **GT donor site** is an A or G, and the base immediately preceding the **AG acceptor site** is a C or T. So you may want to overwrite these settings by invoking these settings after setting the preset:

<code>
minimap2 -ax splice $ASSEMBLY $READS -t $THREADS -G250 --splice-flank=no
</code>