mapping_rnaseq_data_to_your_genome
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| mapping_rnaseq_data_to_your_genome [2018/01/10 13:17] – 129.173.90.108 | mapping_rnaseq_data_to_your_genome [2024/10/25 15:20] (current) – [What the script does] 134.190.144.194 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Mapping RNAseq data to your genome ====== | ||
| + | |||
| + | Initially written by I don't know who. Updated by Joran Martijn, December 2023. (Minor update, KW, Oct 2024) | ||
| + | |||
| + | |||
| For various reasons, at some point, you will want to map your RNAseq data onto your assembled genome. | For various reasons, at some point, you will want to map your RNAseq data onto your assembled genome. | ||
| Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that | Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that | ||
| - | can properly deal with spliced alignments. | + | can properly deal with spliced alignments. |
| - | Before you can map your RNAseq reads to your genome you need to build an index of the genome assembly. | + | ===== Mapping Illumina data ===== |
| - | In a shell script that you can qsub include | + | HISAT2 was developed with Illumina type reads in mind. It may thus not be the optimal mapping technology for older technologies such as Sanger and 454. |
| - | /// | + | ==== The perun shell script ==== |
| - | This will create a series (depending | + | The latest Perun shell submission script template can be found on the [[https:// |
| - | pce_p3x.1.ht2 | + | < |
| + | # | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 20 | ||
| - | pce_p3x.2.ht2 | + | # input |
| + | base='/ | ||
| + | R1=${base}/ | ||
| + | R2=${base}/ | ||
| + | U1=${base}/ | ||
| + | U2=${base}/ | ||
| + | ASSEMBLY=' | ||
| - | pce_p3x.3.ht2 | + | # output |
| + | PREFIX=rnaseq_vs_masked_ergo_genome | ||
| + | INDEX=masked_ergo_genome | ||
| - | Once the indices have been built you can proceed | + | # load hisat2 software |
| - | There are many options | + | source activate hisat2 |
| - | restrict them to the bare essentials. | + | |
| + | # align rnaseq reads | ||
| + | hisat2-build -f $ASSEMBLY $INDEX | ||
| + | |||
| + | ## q: quiet, k: how many a single read can map | ||
| + | hisat2 \ | ||
| + | -q --threads 20 --phred33 \ | ||
| + | --max-intronlen 1000 \ | ||
| + | --rna-strandness RF -k 2 \ | ||
| + | -x $INDEX -1 $R1 -2 $R2 -U $U1,$U2 | \ | ||
| + | samtools sort --threads 20 -o $PREFIX.sort.bam - | ||
| + | |||
| + | samtools index -@ 20 $PREFIX.sort.bam $PREFIX.sort.bam.bai | ||
| + | |||
| + | conda deactivate | ||
| + | </ | ||
| + | |||
| + | To execute this script, copy this script to your working directory, and replace the input and output shell variables | ||
| + | |||
| + | ==== What the script does ==== | ||
| + | |||
| + | The first program that is called is '' | ||
| + | |||
| + | Note that '' | ||
| + | |||
| + | The indexing will create a series of files with the root name eg pce_p3x. The larger your assembly, the more files it creates. | ||
| + | |||
| + | * pce_p3x.1.ht2 | ||
| + | * pce_p3x.2.ht2 | ||
| + | * pce_p3x.3.ht2 | ||
| + | |||
| + | |||
| + | The '' | ||
| + | |||
| + | < | ||
| + | |||
| + | # indicate that the reads are in FASTQ format | ||
| + | # use -f if reads are in FASTA format | ||
| + | -q | ||
| + | |||
| + | # number of threads to use | ||
| + | # equivalent to --threads 6 | ||
| + | -p 6 | ||
| + | |||
| + | # indicate the scoring scheme for the | ||
| + | # per-base quality scores encoded in the FASTQ file | ||
| + | --phred33 | ||
| + | |||
| + | # indicate that you believe that the longest real intron is 1000 bps | ||
| + | # (default is 50,000) | ||
| + | --max-intronlen 1000 | ||
| + | |||
| + | # indicate that your library is stranded | ||
| + | # i.e. you know for each read from whether it came from | ||
| + | # the positive or negative strand of the genome | ||
| + | # (if its not stranded do not invoke this option) | ||
| + | --rna-strandness RF | ||
| + | |||
| + | # indicate how many distinct primary alignments | ||
| + | # to search for for each read | ||
| + | -k 2 | ||
| + | |||
| + | # indicate the root name | ||
| + | # of the index you generated | ||
| + | # with hisat2-build | ||
| + | -x pce_p3x | ||
| + | |||
| + | # indicate the file for the forward (/1) reads | ||
| + | -1 PCE_RNA_1_PairNtrim.fastq | ||
| + | |||
| + | # indicate the file for the reverse (/2) reads | ||
| + | -2 PCE_RNA_2_PairNtrim.fastq | ||
| + | |||
| + | # indicate one or more files containing unpaired reads | ||
| + | -U forward_unpaired.fastq, | ||
| + | |||
| + | # set the maximum (MX) and minimum (MN) mismatch penalties. Default: MX = 6, MN = 2. | ||
| + | # NOTE: you may have to increase this to get the optimal mapping of a read across an intron! | ||
| + | # I have noticed this is not always the case with the default penalty settings, | ||
| + | # so it is good to check -Kelsey Williamson | ||
| + | --mp MX,MN | ||
| + | |||
| + | # set the penalty for non-canonical splice sites (non-GT/AG) - default is 12 | ||
| + | # so if your genome uses non-canonical splice sites, you want to set this to 0 | ||
| + | --pen-noncansplice < | ||
| + | </code> | ||
| + | |||
| + | Note that in this script, the hisat2 | ||
| + | |||
| + | ==== Stranded information ==== | ||
| + | |||
| + | If you used HISAT2 with '' | ||
| + | |||
| + | < | ||
| + | samtools view -b --tag XS:+ < | ||
| + | samtools index < | ||
| + | |||
| + | samtools view -b --tag XS:- < | ||
| + | samtools index < | ||
| + | </ | ||
| + | |||
| + | If you did NOT run '' | ||
| + | |||
| + | ===== Mapping Sanger and 454 data ===== | ||
| + | |||
| + | It may be that you have to retrieve some older Sanger and 454 sequencing data and want to map it to a reference genome. I'm not sure whether HISAT2 is really unfit for longer reads (Sanger: ~1000 bp, 454: ~500 bp), but I've had some sensible results using [[https:// | ||
| + | |||
| + | ==== The perun shell script ==== | ||
| + | |||
| + | The latest Perun shell submission script template can be found on the [[https:// | ||
| - | A typical shell script to qsub would look like this | ||
| < | < | ||
| - | #!/bin/sh | + | #!/bin/bash |
| - | #$ -o logo_p3x_hisat | + | #$ -S /bin/bash |
| + | . / | ||
| #$ -cwd | #$ -cwd | ||
| - | #$ -pe threaded | + | #$ -pe threaded |
| + | #$ -q 256G-batch | ||
| - | / | + | # input |
| + | READS=' | ||
| + | ASSEMBLY=' | ||
| + | |||
| + | # parameters | ||
| + | THREADS=10 | ||
| + | |||
| + | # output | ||
| + | BAM=' | ||
| + | |||
| + | source activate minimap2 | ||
| + | |||
| + | # options | ||
| + | # -a Outputs in SAM format and generates CIGAR strings | ||
| + | # -x Presets (i.e. nanopore, pacbio, short reads etc). Presets are preconfigured sets of options | ||
| + | # -t Number of Threads | ||
| + | |||
| + | # define these options after setting the preset, so they overwrite the overlapping option in the preset | ||
| + | # --secondary=no | ||
| + | # -G< | ||
| + | # --splice-flank=yes | ||
| + | # And that the base immediately before a AG acceptor site is a C or T. | ||
| + | # May be better to set to =no for exotic protists? | ||
| + | |||
| + | # if you want to generate spliced alignments | ||
| + | # because you have long read transcriptome data | ||
| + | minimap2 \ | ||
| + | | ||
| + | -t $THREADS -G250 \ | ||
| + | | samtools view -bS - | samtools sort -o $BAM - | ||
| + | |||
| + | samtools index $BAM | ||
| </ | </ | ||
| + | |||
| + | To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with '' | ||
| + | |||
| + | Minimap2 has a rich set of parameters that you can finely tune (see also its [[https:// | ||
| + | |||
| + | For mapping long read RNAseq data, we can use the preset '' | ||
| + | |||
| + | I think these parameters are particularly tuned for Human / Mouse genomes because '' | ||
| + | |||
| + | < | ||
| + | minimap2 -ax splice $ASSEMBLY $READS -t $THREADS -G250 --splice-flank=no | ||
| + | </ | ||
mapping_rnaseq_data_to_your_genome.1515604655.txt.gz · Last modified: by 129.173.90.108
