This is an old revision of the document!

Mapping RNAseq data to your genome

Initially written by I don't know who. Updated by Joran Martijn, December 2023

For various reasons, at some point, you will want to map your RNAseq data onto your assembled genome.

Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that can properly deal with spliced alignments. Tophat2 is widely used and cited. However, it has now been superseded by HISAT2. Its manual can be found here.

The perun shell script

The latest Perun shell submission script template can be found on the RogerLab GitHub page. Here is a copy paste of what it looks like as of December 2023:

#!/bin/bash
#$ -S /bin/bash
. /etc/profile
#$ -cwd
#$ -pe threaded 20

# input
base='/scratch3/rogerlab_databases/Fastq_files/Ergobibamus/Illumina/RNAseq/trimmomatic-0.39'
R1=${base}/eb_rna_fw.prd.fastq.gz
R2=${base}/eb_rna_rv.prd.fastq.gz
U1=${base}/eb_rna_fw.unp.fastq.gz
U2=${base}/eb_rna_rv.unp.fastq.gz
ASSEMBLY='ergo_cyp_genome.fasta.masked'

# output
PREFIX=rnaseq_vs_masked_ergo_genome
INDEX=masked_ergo_genome

# load hisat2 software
source activate hisat2

# align rnaseq reads
hisat2-build -f $ASSEMBLY $INDEX

## q: quiet, k: how many a single read can map
hisat2 \
    -q --threads 20 --phred33 \
    --max-intronlen 1000 \
    --rna-strandness RF -k 2 \
    -x $INDEX -1 $R1 -2 $R2 -U $U1,$U2 | \
	samtools sort --threads 20 -o $PREFIX.sort.bam -

samtools index -@ 20 $PREFIX.sort.bam $PREFIX.sort.bam.bai

conda deactivate

To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with qsub run_hisat2_rnaseq.sh.

What the script does

The first program that is called is hisat2-build, and it builds an index of the genome assembly. The mapping algorithm invoked by hisat2 in the next line relies on this index. Other mapping algorithms such as bwa and bowtie use similar strategies.

Note that hisat2-build is called here with the -f option. I actually have no idea what that means. There is no documentation on this option, but it was originally written on this LabWiki to use that option, and it doesn't seem to break anything and I'm getting desireable results, so I'm just leaving it in.

The indexing will create a series of files with the root name eg pce_p3x. The larger your assembly, the more files it creates.

pce_p3x.1.ht2
pce_p3x.2.ht2
pce_p3x.3.ht2

The hisat2 options:

# indicate that the reads are in FASTQ format
# use -f if reads are in FASTA format
-q 

# number of threads to use
# equivalent to --threads 6
-p 6 

# indicate the scoring scheme for the 
# per-base quality scores encoded in the FASTQ file
--phred33 

# indicate that you believe that the longest real intron is 1000 bps 
# (default is 50,000)
--max-intronlen 1000

# indicate how many distinct primary alignments 
# to search for for each read
-k 2 

# indicate the root name
# of the index you generated
# with hisat2-build
-x pce_p3x 

# indicate the file for the forward (/1) reads
-1 PCE_RNA_1_PairNtrim.fastq 

# indicate the file for the reverse (/2) reads
-2 PCE_RNA_2_PairNtrim.fastq 

# indicate the name for the SAM output file
-S pce_3x.sam

cgeb2001's DokuWiki!

Table of Contents

Mapping RNAseq data to your genome

The perun shell script

What the script does