This is an old revision of the document!

Mapping RNAseq data to your genome

Initially written by I don't know who. Updated by Joran Martijn, December 2023

For various reasons, at some point, you will want to map your RNAseq data onto your assembled genome.

Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that can properly deal with spliced alignments. Tophat2 is widely used and cited. However, it has now been superseded by HISAT2. Its manual can be found here.

The perun shell script

The latest Perun shell submission script template can be found on the RogerLab GitHub page. Here is a copy paste of what it looks like as of December 2023:

#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20

# input
base='/scratch3/rogerlab_databases/Fastq_files/Ergobibamus/Illumina/RNAseq/trimmomatic-0.39'
R1=${base}/eb_rna_fw.prd.fastq.gz
R2=${base}/eb_rna_rv.prd.fastq.gz
U1=${base}/eb_rna_fw.unp.fastq.gz
U2=${base}/eb_rna_rv.unp.fastq.gz
ASSEMBLY='ergo_cyp_genome.fasta.masked'

# output
PREFIX=rnaseq_vs_masked_ergo_genome
INDEX=masked_ergo_genome

# load hisat2 software
source activate hisat2

# align rnaseq reads
hisat2-build -f $ASSEMBLY $INDEX

## q: quiet, k: how many a single read can map
hisat2 \
    -q --threads 20 --phred33 \
    --max-intronlen 1000 \
    --rna-strandness RF -k 2 \
    -x $INDEX -1 $R1 -2 $R2 -U $U1,$U2 | \
	samtools sort --threads 20 -o $PREFIX.sort.bam -

samtools index -@ 20 $PREFIX.sort.bam $PREFIX.sort.bam.bai

conda deactivate

To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with qsub run_hisat2_rnaseq.sh.

What the script does

The first program that is called is hisat2-build, and it builds an index of the genome assembly. The mapping algorithm invoked by hisat2 in the next line relies on this index. Other mapping algorithms such as bwa and bowtie use similar strategies.

Note that hisat2-build is called here with the -f option. I actually have no idea what that means. There is no documentation on this option, but it was originally written on this LabWiki to use that option, and it doesn't seem to break anything and I'm getting desireable results, so I'm just leaving it in.

The indexing will create a series of files with the root name eg pce_p3x. The larger your assembly, the more files it creates.

pce_p3x.1.ht2
pce_p3x.2.ht2
pce_p3x.3.ht2

The hisat2 options:

# indicate that the reads are in FASTQ format
# use -f if reads are in FASTA format
-q 

# number of threads to use
# equivalent to --threads 6
-p 6 

# indicate the scoring scheme for the 
# per-base quality scores encoded in the FASTQ file
--phred33 

# indicate that you believe that the longest real intron is 1000 bps 
# (default is 50,000)
--max-intronlen 1000

# indicate that your library is stranded
# i.e. you know for each read from whether it came from 
# the positive or negative strand of the genome
# (if its not stranded do not invoke this option)
--rna-strandness RF

# indicate how many distinct primary alignments 
# to search for for each read
-k 2 

# indicate the root name
# of the index you generated
# with hisat2-build
-x pce_p3x 

# indicate the file for the forward (/1) reads
-1 PCE_RNA_1_PairNtrim.fastq 

# indicate the file for the reverse (/2) reads
-2 PCE_RNA_2_PairNtrim.fastq

# indicate one or more files containing unpaired reads
-U forward_unpaired.fastq,reverse_unpaired.fastq

Note that in this script, the hisat2 output is directly fed into samtools sort which then generates a sorted BAM file immediately. The formation of a SAM file is thus skipped. This is beneficial because it reduces computation time, and the number of samtools commands to run. A SAM file is essentially unnecessary once you have the BAM file anyway. However, if you still wish to generate the SAM file, you can remove the pipe-into-samtools line and replace it with another option for hisat2: -S <output.sam>

Stranded information

If you used HISAT2 with –rna-strandness RF to map your RNAseq reads to your genome, all “positive” reads will have the tag XS:A:+, while all “negative” reads will have the XS:A:- tag. One can separate the two types of reads from the HISAT2-generated BAM file using the following command:

samtools view -b --tag XS:+ <rnaseq.bam> > <rnaseq.pos.bam>
samtools index <rnaseq.pos.bam> <rnaseq.pos.bam.bai>

samtools view -b --tag XS:- <rnaseq.bam> > <rnaseq.neg.bam>
samtools index <rnaseq.neg.bam> <rnaseq.neg.bam.bai>

If you did NOT run –rna-strandness RF but still attempt to do the separation based on the XS tag, something odd will happen: only read mappings that have spliced alignment (i.e. those reads that overlap with introns) will be extracted from the original BAM file. This can lead to volcano-like expression patterns when visualizing the RNAseq coverage in IGV.

cgeb2001's DokuWiki!

Table of Contents

Mapping RNAseq data to your genome

The perun shell script

What the script does

Stranded information