mapping_rnaseq_data_to_your_genome
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| mapping_rnaseq_data_to_your_genome [2024/02/20 11:29] – [Mapping RNAseq data to your genome] 134.190.232.164 | mapping_rnaseq_data_to_your_genome [2024/10/25 15:20] (current) – [What the script does] 134.190.144.194 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Mapping RNAseq data to your genome ====== | ====== Mapping RNAseq data to your genome ====== | ||
| - | Initially written by I don't know who. Updated by Joran Martijn, December 2023 | + | Initially written by I don't know who. Updated by Joran Martijn, December 2023. (Minor update, KW, Oct 2024) |
| Line 8: | Line 8: | ||
| Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that | Unless you have a prokaryotic genome or a eukaryotic genome with intronless genes you need a mapper that | ||
| can properly deal with spliced alignments. [[https:// | can properly deal with spliced alignments. [[https:// | ||
| + | |||
| + | ===== Mapping Illumina data ===== | ||
| HISAT2 was developed with Illumina type reads in mind. It may thus not be the optimal mapping technology for older technologies such as Sanger and 454. | HISAT2 was developed with Illumina type reads in mind. It may thus not be the optimal mapping technology for older technologies such as Sanger and 454. | ||
| - | ===== The perun shell script | + | ==== The perun shell script ==== |
| The latest Perun shell submission script template can be found on the [[https:// | The latest Perun shell submission script template can be found on the [[https:// | ||
| Line 55: | Line 57: | ||
| To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with '' | To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with '' | ||
| - | ===== What the script does ===== | + | ==== What the script does ==== |
| The first program that is called is '' | The first program that is called is '' | ||
| Line 111: | Line 113: | ||
| # indicate one or more files containing unpaired reads | # indicate one or more files containing unpaired reads | ||
| -U forward_unpaired.fastq, | -U forward_unpaired.fastq, | ||
| + | |||
| + | # set the maximum (MX) and minimum (MN) mismatch penalties. Default: MX = 6, MN = 2. | ||
| + | # NOTE: you may have to increase this to get the optimal mapping of a read across an intron! | ||
| + | # I have noticed this is not always the case with the default penalty settings, | ||
| + | # so it is good to check -Kelsey Williamson | ||
| + | --mp MX,MN | ||
| + | |||
| + | # set the penalty for non-canonical splice sites (non-GT/AG) - default is 12 | ||
| + | # so if your genome uses non-canonical splice sites, you want to set this to 0 | ||
| + | --pen-noncansplice <int> | ||
| </ | </ | ||
| Note that in this script, the hisat2 output is directly fed into '' | Note that in this script, the hisat2 output is directly fed into '' | ||
| - | ===== Stranded information | + | ==== Stranded information ==== |
| If you used HISAT2 with '' | If you used HISAT2 with '' | ||
| Line 128: | Line 140: | ||
| If you did NOT run '' | If you did NOT run '' | ||
| + | |||
| + | ===== Mapping Sanger and 454 data ===== | ||
| + | |||
| + | It may be that you have to retrieve some older Sanger and 454 sequencing data and want to map it to a reference genome. I'm not sure whether HISAT2 is really unfit for longer reads (Sanger: ~1000 bp, 454: ~500 bp), but I've had some sensible results using [[https:// | ||
| + | |||
| + | ==== The perun shell script ==== | ||
| + | |||
| + | The latest Perun shell submission script template can be found on the [[https:// | ||
| + | |||
| + | < | ||
| + | #!/bin/bash | ||
| + | #$ -S /bin/bash | ||
| + | . / | ||
| + | #$ -cwd | ||
| + | #$ -pe threaded 10 | ||
| + | #$ -q 256G-batch | ||
| + | |||
| + | # input | ||
| + | READS=' | ||
| + | ASSEMBLY=' | ||
| + | |||
| + | # parameters | ||
| + | THREADS=10 | ||
| + | |||
| + | # output | ||
| + | BAM=' | ||
| + | |||
| + | source activate minimap2 | ||
| + | |||
| + | # options | ||
| + | # -a Outputs in SAM format and generates CIGAR strings | ||
| + | # -x Presets (i.e. nanopore, pacbio, short reads etc). Presets are preconfigured sets of options | ||
| + | # -t Number of Threads | ||
| + | |||
| + | # define these options after setting the preset, so they overwrite the overlapping option in the preset | ||
| + | # --secondary=no | ||
| + | # -G< | ||
| + | # --splice-flank=yes | ||
| + | # And that the base immediately before a AG acceptor site is a C or T. | ||
| + | # May be better to set to =no for exotic protists? | ||
| + | |||
| + | # if you want to generate spliced alignments | ||
| + | # because you have long read transcriptome data | ||
| + | minimap2 \ | ||
| + | -ax splice $ASSEMBLY $READS \ | ||
| + | -t $THREADS -G250 \ | ||
| + | | samtools view -bS - | samtools sort -o $BAM - | ||
| + | |||
| + | samtools index $BAM | ||
| + | </ | ||
| + | |||
| + | To execute this script, copy this script to your working directory, and replace the input and output shell variables with your own reads and assembly files. Then simply invoke the script with '' | ||
| + | |||
| + | Minimap2 has a rich set of parameters that you can finely tune (see also its [[https:// | ||
| + | |||
| + | For mapping long read RNAseq data, we can use the preset '' | ||
| + | |||
| + | I think these parameters are particularly tuned for Human / Mouse genomes because '' | ||
| + | |||
| + | < | ||
| + | minimap2 -ax splice $ASSEMBLY $READS -t $THREADS -G250 --splice-flank=no | ||
| + | </ | ||
mapping_rnaseq_data_to_your_genome.1708442992.txt.gz · Last modified: by 134.190.232.164
