mapping_rnaseq_data_to_your_genome
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| mapping_rnaseq_data_to_your_genome [2018/01/10 13:24] – 129.173.90.108 | mapping_rnaseq_data_to_your_genome [2024/10/25 15:20] (current) – [What the script does] 134.190.144.194 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Mapping RNAseq data to your genome ====== | ||
| + | |||
| + | Initially written by I don't know who. Updated by Joran Martijn, December 2023. (Minor update, KW, Oct 2024) | ||
| + | |||
| + | |||
| For various reasons, at some point, you will want to map your RNAseq data onto your assembled genome. | For various reasons, at some point, you will want to map your RNAseq data onto your assembled genome. | ||
| Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that | Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that | ||
| - | can properly deal with spliced alignments. | + | can properly deal with spliced alignments. |
| - | Before you can map your RNAseq reads to your genome you need to build an index of the genome assembly. | + | ===== Mapping Illumina data ===== |
| - | In a shell script that you can qsub include | + | HISAT2 was developed with Illumina type reads in mind. It may thus not be the optimal mapping technology for older technologies such as Sanger and 454. |
| - | /// | + | ==== The perun shell script ==== |
| - | This will create a series (depending | + | The latest Perun shell submission script template can be found on the [[https:// |
| - | pce_p3x.1.ht2 | + | < |
| + | # | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 20 | ||
| - | pce_p3x.2.ht2 | + | # input |
| + | base='/ | ||
| + | R1=${base}/ | ||
| + | R2=${base}/ | ||
| + | U1=${base}/ | ||
| + | U2=${base}/ | ||
| + | ASSEMBLY=' | ||
| - | pce_p3x.3.ht2 | + | # output |
| + | PREFIX=rnaseq_vs_masked_ergo_genome | ||
| + | INDEX=masked_ergo_genome | ||
| - | Once the indices have been built you can proceed | + | # load hisat2 software |
| - | There are many options available http://ccb.jhu.edu/ | + | source activate hisat2 |
| - | restrict them to the bare essentials. | + | |
| + | # align rnaseq reads | ||
| + | hisat2-build -f $ASSEMBLY $INDEX | ||
| + | |||
| + | ## q: quiet, k: how many a single read can map | ||
| + | hisat2 \ | ||
| + | -q --threads 20 --phred33 \ | ||
| + | --max-intronlen 1000 \ | ||
| + | --rna-strandness RF -k 2 \ | ||
| + | -x $INDEX -1 $R1 -2 $R2 -U $U1,$U2 | \ | ||
| + | samtools sort --threads 20 -o $PREFIX.sort.bam - | ||
| + | |||
| + | samtools index -@ 20 $PREFIX.sort.bam $PREFIX.sort.bam.bai | ||
| + | |||
| + | conda deactivate | ||
| + | </ | ||
| + | |||
| + | To execute this script, copy this script to your working directory, and replace the input and output shell variables | ||
| + | |||
| + | ==== What the script does ==== | ||
| + | |||
| + | The first program that is called is '' | ||
| + | |||
| + | Note that '' | ||
| + | |||
| + | The indexing will create a series of files with the root name eg pce_p3x. The larger your assembly, the more files it creates. | ||
| + | |||
| + | * pce_p3x.1.ht2 | ||
| + | * pce_p3x.2.ht2 | ||
| + | * pce_p3x.3.ht2 | ||
| + | |||
| + | |||
| + | The '' | ||
| - | A typical shell script to qsub would look like this | ||
| < | < | ||
| - | #!/bin/sh | ||
| - | #$ -o logo_p3x_hisat | ||
| - | #$ -cwd | ||
| - | #$ -pe threaded 6 | ||
| - | / | + | # indicate that the reads are in FASTQ format |
| + | # use -f if reads are in FASTA format | ||
| + | -q | ||
| + | |||
| + | # number of threads to use | ||
| + | # equivalent to --threads 6 | ||
| + | -p 6 | ||
| + | |||
| + | # indicate the scoring scheme for the | ||
| + | # per-base quality scores encoded in the FASTQ file | ||
| + | --phred33 | ||
| + | |||
| + | # indicate that you believe that the longest real intron is 1000 bps | ||
| + | # (default is 50,000) | ||
| + | --max-intronlen 1000 | ||
| + | |||
| + | # indicate that your library is stranded | ||
| + | # i.e. you know for each read from whether it came from | ||
| + | # the positive or negative strand of the genome | ||
| + | # (if its not stranded do not invoke this option) | ||
| + | --rna-strandness RF | ||
| + | |||
| + | # indicate how many distinct primary alignments | ||
| + | # to search for for each read | ||
| + | -k 2 | ||
| + | |||
| + | # indicate the root name | ||
| + | # of the index you generated | ||
| + | # with hisat2-build | ||
| + | -x pce_p3x | ||
| + | |||
| + | # indicate the file for the forward (/1) reads | ||
| + | -1 PCE_RNA_1_PairNtrim.fastq | ||
| + | |||
| + | # indicate the file for the reverse (/2) reads | ||
| + | -2 PCE_RNA_2_PairNtrim.fastq | ||
| + | |||
| + | # indicate one or more files containing unpaired reads | ||
| + | -U forward_unpaired.fastq, | ||
| + | |||
| + | # set the maximum (MX) and minimum (MN) mismatch penalties. Default: MX = 6, MN = 2. | ||
| + | # NOTE: you may have to increase this to get the optimal mapping of a read across an intron! | ||
| + | # I have noticed this is not always the case with the default penalty settings, | ||
| + | # so it is good to check -Kelsey Williamson | ||
| + | --mp MX,MN | ||
| + | |||
| + | # set the penalty for non-canonical splice sites (non-GT/AG) - default is 12 | ||
| + | # so if your genome uses non-canonical splice sites, you want to set this to 0 | ||
| + | --pen-noncansplice <int> | ||
| </ | </ | ||
| - | where | + | Note that in this script, the hisat2 output is directly fed into '' |
| - | -q indicates that the reads are in fastq format versus -f for fasta | + | ==== Stranded information ==== |
| - | -p 6 indicates how many threads | + | If you used HISAT2 with '' |
| - | --phred33 indicates that scoring scheme for the quality scores | + | < |
| + | samtools view -b --tag XS:+ < | ||
| + | samtools index < | ||
| - | --max-intronlen 1000 indicates that you believe that the longest real intron is 1000 bps (default is 50,000) | + | samtools view -b --tag XS:- < |
| + | samtools index < | ||
| + | </ | ||
| - | -k 2 indicates | + | If you did NOT run '' |
| - | -x pce_p3x is the index root name | + | ===== Mapping Sanger and 454 data ===== |
| - | -1 PCE_RNA_1_PairNtrim.fastq indicates the file for the forward | + | It may be that you have to retrieve some older Sanger and 454 sequencing data and want to map it to a reference genome. I'm not sure whether HISAT2 is really unfit for longer reads (Sanger: ~1000 bp, 454: ~500 bp), but I've had some sensible results using [[https:// |
| - | -2 PCE_RNA_2_PairNtrim.fastq indicates the file for the reverse (/2) reads | + | ==== The perun shell script ==== |
| - | -S pce_3x.sam indicates | + | The latest Perun shell submission script template can be found on the [[https:// |
| + | |||
| + | < | ||
| + | # | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 10 | ||
| + | #$ -q 256G-batch | ||
| + | |||
| + | # input | ||
| + | READS=' | ||
| + | ASSEMBLY=' | ||
| + | |||
| + | # parameters | ||
| + | THREADS=10 | ||
| + | |||
| + | # output | ||
| + | BAM=' | ||
| + | |||
| + | source activate minimap2 | ||
| + | |||
| + | # options | ||
| + | # -a Outputs in SAM format and generates CIGAR strings | ||
| + | # -x Presets (i.e. nanopore, pacbio, short reads etc). Presets are preconfigured sets of options | ||
| + | # -t Number of Threads | ||
| + | |||
| + | # define these options after setting | ||
| + | # --secondary=no | ||
| + | # -G< | ||
| + | # --splice-flank=yes | ||
| + | # And that the base immediately before a AG acceptor site is a C or T. | ||
| + | # May be better to set to =no for exotic protists? | ||
| + | |||
| + | # if you want to generate spliced alignments | ||
| + | # because you have long read transcriptome data | ||
| + | minimap2 \ | ||
| + | -ax splice $ASSEMBLY $READS \ | ||
| + | -t $THREADS -G250 \ | ||
| + | | samtools view -bS - | samtools sort -o $BAM - | ||
| + | |||
| + | samtools index $BAM | ||
| + | </ | ||
| + | |||
| + | To execute this script, copy this script to your working directory, and replace | ||
| + | |||
| + | Minimap2 has a rich set of parameters that you can finely tune (see also its [[https:// | ||
| + | |||
| + | For mapping long read RNAseq data, we can use the preset '' | ||
| + | |||
| + | I think these parameters are particularly tuned for Human / Mouse genomes because '' | ||
| + | |||
| + | < | ||
| + | minimap2 -ax splice $ASSEMBLY $READS -t $THREADS -G250 --splice-flank=no | ||
| + | </ | ||
mapping_rnaseq_data_to_your_genome.1515605068.txt.gz · Last modified: by 129.173.90.108
