by Tommy Harding
Notes on this tutorial:
A- Quantify gene expression using RSEM
(https://deweylab.github.io/RSEM/README.html). RSEM maps reads onto transcript sequences considering multi-mapping reads (i.e. reads that map at multiple locations in the transcriptome) in order to determine spliced isoform-specific expression. The gene count matrix generated by RSEM can then be used to assess differential expression using other programs like EBSeq, DESeq or VOOM-limma.
1- Prepare reference file:
rsem-prepare-reference --no-polyA --bowtie --bowtie-path /usr/bin/ <transcriptome.fasta> <assembly_short_name>
2- Map reads on transcripts for each replicate of each condition independently †:
rsem-calculate-expression <comma-separated_list_of_fastq_files_for_considered_replicate> <assembly_short_name> <condition_X_replicate_X_ID> --calc-ci --num-threads 2 --bowtie-path /usr/bin --estimate-rspd --fragment-length-mean <X> --fragment-length-sd <Y>
3- Generate gene expression data matrix (for isoform-level expression, use * isoforms.results files (as shown in the example command line below); for gene-level expression, use *genes.results files):
rsem-generate-data-matrix <condition_1_replicate_1_ID>.isoforms.results <condition_1_replicate_2_ID>.isoforms.results <condition_1_replicate_3_ID>.isoforms.results <condition_2_replicate_1_ID>.isoforms.results <condition_2_replicate_1_ID>.isoforms.results <condition_2_replicate_2 ID>.isoforms.results <condition_2_replicate_3_ID>.isoforms.results > <matrix_name>
4- Generate ng vectors (for isoform-level differential expression assessment only):
rsem-generate-ngvector <transcriptome.fasta> <assembly_short_name>
A’- Calculate fragment size using Qualimap. In order to determine the size of your library empirically, map paired reads using bowtie and then generate statistics on your data using Qualimap. Perform this analysis in another folder than the one used for the RSEM run.
1- Index transcript sequences:
/opt/perun/bowtie2-2.2.4/bowtie2-build -f <transcriptome.fasta> <assembly_short_name>
2- Map paired reads onto transcript sequences †:
/opt/perun/bowtie2-2.2.4/bowtie2 -a -q --phred33 -p 4 -x <assembly_short_name> -1 <fastq_file_for_forward_reads> -2 <fastq_file_for_reverse_reads> -S <output_name_in_sam_format>
3- Convert sam file into bam file †:
samtools view -bS <output_name_in_sam_format> > <output_name_in_bam_format>
4- Sort bam file †:
samtools sort <output_name_in_bam_format> <output_name_in_bam_format_for_sorted_file>
5- Use Qualimap to get statistics †.:
/opt/perun/qualimap_v2.1.1/qualimap bamqc -bam <output_name_in_bam_format_for_sorted_file> -nt 4 -outdir <path_to_directory_in_which_to_write_the_output> --java-mem-size=4G
B- Assess differential expression using EBSeq (http://www.bioconductor.org/packages/devel/bioc/vignettes/EBSeq/inst/doc/EBSeq_Vignette.pdf). In order to access to the full potential of EBSeq, run the program in R. (To start an R session on perun, type R.)
1- Load the EBSeq library:
> library(EBSeq)
2- Define the data matrix:
> IsoMat <- data.matrix(read.table(file="<matrix_name>"))
3- Normalize read counts (here, median normalization):
> Sizes=MedianNorm(IsoMat)
4- Define ng vectors:
> NgVec <- scan(file="<assembly_short_name>.ngvec", what=0, sep="\n")
5- Define experimental conditions:
> Conditions=as.factor(rep(c("<condition1>","<condition2>"), each=3))
> Conditions=as.factor(rep(c("<condition1>","<condition2>", "<condition3>"), each=3))
> Patterns=GetPatterns(Conditions)
6- Assess differential expression:
> IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, Conditions=Conditions, sizeFactors=Sizes, maxround=10)
> IsoEBres=EBMultiTest(Data=IsoMat, NgVector=NgVec, Conditions=Conditions, AllParti=Patterns, sizeFactors=Sizes, maxround=10)
7- Make sure parameters (Alpha, Beta and P) have converged:
> IsoEBres $Alpha values for Alpha for each iteration will be listed here... > IsoEBres $Beta values for Beta for each iteration will be listed here... > IsoEBres $P values for P for each iteration will be listed here...
8- Write results to outfile:
> IsoPP=GetPPMat(IsoEBres) > write.table(IsoPP,file="<prefix_for_outfile_for_probabilities>_PPDE.txt") > IsoFC=PostFC(IsoEBres) > write.table(IsoFC,file="<prefix_for_outfile_for_fold_change_values>_FC.txt")
> IsoMultiPP=GetMultiPP(IsoEBres) > write.table(IsoMultiPP[1],file="<prefix_for_outfile_for_probabilities>_PPDE.txt") > IsoMultiFC=GetMultiFC(IsoEBres) > write.table(IsoMultiFC[3],file="<prefix_for_outfile_for_fold_change_values>_FC.txt ")
9- Exit R:
> q()
10- Merge probabilities and fold changes into one file (called <project_name>_DEisoforms_EBSeqout.txt for isoform-level analysis):
python /scratch/tharding/SCRIPTS/RSEM_catPPDEpostFC.py <prefix_for_outfile_for_probabilities>_PPDE.txt <prefix_for_outfile_for_fold_change_values>_FC.txt
python /scratch2/shess/SCRIPTS/RSEM_catPPDEpostFC_multiCond.py <prefix_for_outfile_for_probabilities>_PPDE.txt prefix for outfile for fold change values>_FC.txt
11- Generate wiggle plots (number of reads mapping along ORFs) that are very useful when interpreting results:
python /scratch2/shess/SCRIPTS/Extract_seqnames.py <assembly_short_name>.transcripts.fa
rsem-plot-transcript-wiggles <condition_X_replicate_X_ID> <assembly_short_name>.transcriptsIDs.txt <condition_X_replicate_X_ID>_wiggle.pdf --show-unique
C- Assess differential expression using DESeq2. (http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
1- Remove sequences with low read abundance (according to EBSeq, i.e. transcripts with 75 th quantile < = 10):
python /scratch/tharding/SCRIPTS/RemoveLowExpr.py <EBSeq_output_file_containing_PPDE_and_FC_values_created_at_step_B-10> <gene_count_matrix_generated_at_step_B-2>
2- Transform read counts to integers:
python /scratch/tharding/SCRIPTS/GetIntegerCounts.py <gene_count_matrix_without_lowly_expressed_genes_generated_at_step_C-1>
Run DESeq2 in R (type R to start an R session on perun).
3- Load DESeq2 library:
> library(DESeq2)
4- Define the count table:
> countsTable <- read.delim("<matrix_generated_at_step_C-2>",header=FALSE)
> countsTable <- countsTable[,-1]
5- Define conditions/replicates (here, 2 conditions x 3 replicates for example)
> colData <- data.frame(condition=factor(c("untreated","untreated","untreated","treated","treated","treated")))
6- Assess differential expression:
> dds <- DESeqDataSetFromMatrix(countsTable, colData, formula(~ condition))
> dds$condition <- factor(dds$condition, levels=c("untreated","treated"))
> dds <- DESeq(dds)
> res <- results(dds, alpha=0.05)
7- Create output file:
> write.csv(as.data.frame(res),file="<DESeq2_output_file_name>")
8- Exit R:
> q()
9- Recover original sequence names:
python /scratch/tharding/SCRIPTS/RecoverDESeqTable.py <gene_count_matrix_without_lowly_expressed_genes_generated_at_step_C-1> <DESeq2_output_file_name>
D- Assess differential expression using voom-limma. (https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf)
1- Remove sequences with low read abundance (according to EBSeq, i.e. transcripts with 75 th quantile < = 10) as in step C-1:
python /scratch/tharding/SCRIPTS/RemoveLowExpr.py <EBSeq_output_file_containing_PPDE_and_FC_values_created_at_step_B-10> <gene_count_matrix_generated_at_step_B-2>
Run voom-limma in R (to start an R session on perun, type R).
2- Load libraries:
> library(limma) > library(edgeR)
3- Define the count table and prepare data (here, for 2 conditions x 3 replicates):
> counts <- read.delim("<matrix_generated_at_step_D-1>", header=FALSE)
> rownames(counts) <- counts$gene
> counts <- counts[,-1]
> dge <- DGEList(counts=counts, group=rep(1:2,each=3))
4- Normalize read counts (here, using the trimmed mean of M-values (TMM) normalization):
> dge <- calcNormFactors(dge)
5- Define the experimental design (here, 2 conditions x 3 replicates where OPT is the ‘basal’ condition and OPTvsMAX is the ‘tested’ condition):
> design <- cbind(OPT=1,OPTvsMAX=c(0,0,0,1,1,1))
6- Transform data using the voom method, a normal linear modeling strategy:
> v <- voom(dge,design,plot=TRUE)
7- Assess differential expression:
> fit <- eBayes(lmFit(v,design))
8- Write output file for uncorrected p-values:
> write.table(fit, file ="<limma_output_file_name_for_uncorrected_p-values>")
9- Write output file summarizing p-values corrected for multiple testing:
> results <- topTable(fit, coef="OPTvsMAX", n=Inf, p=1, adjust="BH") > write.table(results, file="<limma output file name for corrected p-values>")
9- Exit R:
> q()
10- Recover original sequence names:
python /scratch/tharding/SCRIPTS/RecoverDESeqTable.py <gene_count_matrix_without_lowly_expressed_genes_generated_at_step_D-1> <limma_output_file_name>