Differential gene expression analysis

by Tommy Harding

Notes on this tutorial:

Terms in between <> are user-specified names.
Steps indicated with † are computer-intensive and should be submitted to perun through a shell script.
This tutorial summarizes the basics. To benefit the full potential of each program, refer to the user manuals.

A- Quantify gene expression using RSEM (https://deweylab.github.io/RSEM/README.html). RSEM maps reads onto transcript sequences considering multi-mapping reads (i.e. reads that map at multiple locations in the transcriptome) in order to determine spliced isoform-specific expression. The gene count matrix generated by RSEM can then be used to assess differential expression using other programs like EBSeq, DESeq or VOOM-limma.

1- Prepare reference file:

rsem-prepare-reference --no-polyA --bowtie --bowtie-path /usr/bin/ <transcriptome.fasta> <assembly_short_name>

When measuring spliced isoform-specific expression, some effort should be invested to reduce the redundancy of sequences in the assembly (especially if working with sequences assembled by Trinity which generates many many putative isoforms). RSEM assigns reads to isoforms based on the pool of isoforms onto which the reads map. In other words, the sequences in your transcriptome affect how the reads are assigned to each sequence. One strategy can be to discard the smallest protein sequence of highly similar sequence pair if the alignment covers >90% of its length and if <5 mismatches is observed. You therefore end up with the longest sequences which are probably the most legitimate… However, if the assembly contains sequences with introns (which happens if genomic DNA contaminated the RNA extracts used to build libraries, or if the organism of interest is sloppy at splicing), these sequences will be favored. As a result, it is always good practice to investigate read coverage when interpreting results (see step B-11).

2- Map reads on transcripts for each replicate of each condition independently †:

rsem-calculate-expression <comma-separated_list_of_fastq_files_for_considered_replicate> <assembly_short_name> <condition_X_replicate_X_ID> --calc-ci --num-threads 2 --bowtie-path /usr/bin --estimate-rspd --fragment-length-mean <X> --fragment-length-sd <Y>

It is not recommended to use both paired-read and single-read sets (post-quality-trimming) as it could lead to bias in transcript abundance. Both reads (forward and reverse) of the same pair essentially carry the same gene expression information (they come from the same fragment). To take advantage of single reads, consider only the forward reads of both paired and unpaired (after quality filtering) reads. Note that some analysts completely discard unpaired reads to be stringent (they consider that if one of the read member of a pair did not pass the quality filter, it is not a good sign). It is for you to judge…
Especially if you are not using both reads of pairs, it is important to inform RSEM about the size of your library. Fragment size values (X and Y in the command line above) can be calculated empirically using Qualimap (see section A’ below).

3- Generate gene expression data matrix (for isoform-level expression, use * isoforms.results files (as shown in the example command line below); for gene-level expression, use *genes.results files):

rsem-generate-data-matrix <condition_1_replicate_1_ID>.isoforms.results <condition_1_replicate_2_ID>.isoforms.results  <condition_1_replicate_3_ID>.isoforms.results <condition_2_replicate_1_ID>.isoforms.results <condition_2_replicate_1_ID>.isoforms.results <condition_2_replicate_2 ID>.isoforms.results <condition_2_replicate_3_ID>.isoforms.results  >  <matrix_name>

4- Generate ng vectors (for isoform-level differential expression assessment only):

rsem-generate-ngvector <transcriptome.fasta> <assembly_short_name>

A’- Calculate fragment size using Qualimap. In order to determine the size of your library empirically, map paired reads using bowtie and then generate statistics on your data using Qualimap. Perform this analysis in another folder than the one used for the RSEM run.

1- Index transcript sequences:

/opt/perun/bowtie2-2.2.4/bowtie2-build -f <transcriptome.fasta>  <assembly_short_name>

2- Map paired reads onto transcript sequences †:

/opt/perun/bowtie2-2.2.4/bowtie2 -a -q --phred33 -p 4 -x <assembly_short_name> -1 <fastq_file_for_forward_reads> -2 <fastq_file_for_reverse_reads> -S <output_name_in_sam_format>

3- Convert sam file into bam file †:

samtools view -bS <output_name_in_sam_format>  >  <output_name_in_bam_format>

4- Sort bam file †:

samtools sort <output_name_in_bam_format> <output_name_in_bam_format_for_sorted_file>

5- Use Qualimap to get statistics †.:

/opt/perun/qualimap_v2.1.1/qualimap bamqc -bam <output_name_in_bam_format_for_sorted_file> -nt 4 -outdir <path_to_directory_in_which_to_write_the_output> --java-mem-size=4G

Important to set java-mem-size value to higher than the default otherwise you run out of memory. Also, use the higher RAM nodes (e.g. 'qsub -q 256G-batch')

B- Assess differential expression using EBSeq (http://www.bioconductor.org/packages/devel/bioc/vignettes/EBSeq/inst/doc/EBSeq_Vignette.pdf). In order to access to the full potential of EBSeq, run the program in R. (To start an R session on perun, type R.)

1- Load the EBSeq library:

> library(EBSeq)

2- Define the data matrix:

> IsoMat <- data.matrix(read.table(file="<matrix_name>"))

3- Normalize read counts (here, median normalization):

> Sizes=MedianNorm(IsoMat)

4- Define ng vectors:

> NgVec <- scan(file="<assembly_short_name>.ngvec", what=0, sep="\n")

5- Define experimental conditions:

for 2 conditions x 3 replicates:

> Conditions=as.factor(rep(c("<condition1>","<condition2>"), each=3))

for >2 conditions (here, 3 conditions x 3 replicates):

> Conditions=as.factor(rep(c("<condition1>","<condition2>", "<condition3>"), each=3))
> Patterns=GetPatterns(Conditions)

6- Assess differential expression:

for 2 conditions:

> IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, Conditions=Conditions, sizeFactors=Sizes, maxround=10)

for >2 conditions:

> IsoEBres=EBMultiTest(Data=IsoMat, NgVector=NgVec, Conditions=Conditions, AllParti=Patterns, sizeFactors=Sizes, maxround=10)

7- Make sure parameters (Alpha, Beta and P) have converged:

> IsoEBres $Alpha
values for Alpha for each iteration will be listed here...
> IsoEBres $Beta
values for Beta for each iteration will be listed here...
> IsoEBres $P
values for P for each iteration will be listed here...

The variation between the second last and the last iterations must be <0.001. If parameters haven’t converged, perform step 6 again but increasing the value for maxround.

8- Write results to outfile:

for 2 conditions:

> IsoPP=GetPPMat(IsoEBres)
> write.table(IsoPP,file="<prefix_for_outfile_for_probabilities>_PPDE.txt")
> IsoFC=PostFC(IsoEBres)
> write.table(IsoFC,file="<prefix_for_outfile_for_fold_change_values>_FC.txt")

for >2 conditions:

> IsoMultiPP=GetMultiPP(IsoEBres)
> write.table(IsoMultiPP[1],file="<prefix_for_outfile_for_probabilities>_PPDE.txt")
> IsoMultiFC=GetMultiFC(IsoEBres)
> write.table(IsoMultiFC[3],file="<prefix_for_outfile_for_fold_change_values>_FC.txt ")

9- Exit R:

> q()

10- Merge probabilities and fold changes into one file (called <project_name>_DEisoforms_EBSeqout.txt for isoform-level analysis):

for 2 conditions:

python /scratch/tharding/SCRIPTS/RSEM_catPPDEpostFC.py <prefix_for_outfile_for_probabilities>_PPDE.txt  <prefix_for_outfile_for_fold_change_values>_FC.txt

for >2 conditions:

python /scratch2/shess/SCRIPTS/RSEM_catPPDEpostFC_multiCond.py <prefix_for_outfile_for_probabilities>_PPDE.txt  prefix for outfile for fold change values>_FC.txt

11- Generate wiggle plots (number of reads mapping along ORFs) that are very useful when interpreting results:

create a file with sequence names:

python /scratch2/shess/SCRIPTS/Extract_seqnames.py <assembly_short_name>.transcripts.fa

For each replicate of each condition †:

rsem-plot-transcript-wiggles <condition_X_replicate_X_ID> <assembly_short_name>.transcriptsIDs.txt <condition_X_replicate_X_ID>_wiggle.pdf --show-unique

Using the –show-unique flag, read depth for unique reads will be shown in black, while read depth of reads which align to more than one places will be in red.

C- Assess differential expression using DESeq2. (http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)

1- Remove sequences with low read abundance (according to EBSeq, i.e. transcripts with 75 th quantile < = 10):

python /scratch/tharding/SCRIPTS/RemoveLowExpr.py <EBSeq_output_file_containing_PPDE_and_FC_values_created_at_step_B-10> <gene_count_matrix_generated_at_step_B-2>

2- Transform read counts to integers:

python /scratch/tharding/SCRIPTS/GetIntegerCounts.py <gene_count_matrix_without_lowly_expressed_genes_generated_at_step_C-1>

Run DESeq2 in R (type R to start an R session on perun).
3- Load DESeq2 library:

> library(DESeq2)

4- Define the count table:

> countsTable <- read.delim("<matrix_generated_at_step_C-2>",header=FALSE)
> countsTable <- countsTable[,-1]

5- Define conditions/replicates (here, 2 conditions x 3 replicates for example)

> colData <- data.frame(condition=factor(c("untreated","untreated","untreated","treated","treated","treated")))

6- Assess differential expression:

> dds <- DESeqDataSetFromMatrix(countsTable, colData, formula(~ condition))
> dds$condition <- factor(dds$condition, levels=c("untreated","treated"))
> dds <- DESeq(dds)
> res <- results(dds, alpha=0.05)

7- Create output file:

> write.csv(as.data.frame(res),file="<DESeq2_output_file_name>")

8- Exit R:

> q()

9- Recover original sequence names:

python /scratch/tharding/SCRIPTS/RecoverDESeqTable.py <gene_count_matrix_without_lowly_expressed_genes_generated_at_step_C-1> <DESeq2_output_file_name>

D- Assess differential expression using voom-limma. (https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf)

1- Remove sequences with low read abundance (according to EBSeq, i.e. transcripts with 75 th quantile < = 10) as in step C-1:

python /scratch/tharding/SCRIPTS/RemoveLowExpr.py <EBSeq_output_file_containing_PPDE_and_FC_values_created_at_step_B-10> <gene_count_matrix_generated_at_step_B-2>

Run voom-limma in R (to start an R session on perun, type R).
2- Load libraries:

> library(limma)
> library(edgeR)

3- Define the count table and prepare data (here, for 2 conditions x 3 replicates):

> counts <- read.delim("<matrix_generated_at_step_D-1>", header=FALSE)
> rownames(counts) <- counts$gene
> counts <- counts[,-1]
> dge <- DGEList(counts=counts, group=rep(1:2,each=3))

4- Normalize read counts (here, using the trimmed mean of M-values (TMM) normalization):

> dge <- calcNormFactors(dge)

5- Define the experimental design (here, 2 conditions x 3 replicates where OPT is the ‘basal’ condition and OPTvsMAX is the ‘tested’ condition):

> design <- cbind(OPT=1,OPTvsMAX=c(0,0,0,1,1,1))

6- Transform data using the voom method, a normal linear modeling strategy:

> v <- voom(dge,design,plot=TRUE)

7- Assess differential expression:

> fit <- eBayes(lmFit(v,design))

8- Write output file for uncorrected p-values:

> write.table(fit, file ="<limma_output_file_name_for_uncorrected_p-values>")

9- Write output file summarizing p-values corrected for multiple testing:

> results <- topTable(fit, coef="OPTvsMAX", n=Inf, p=1, adjust="BH")            
> write.table(results, file="<limma output file name for corrected p-values>")

n=Inf specifies to consider all genes
p=1 specifies the cutoff for adjusted p-value or what adjusted p-value to report in the output (here reporting all adjusted p-values; the table can be parsed afterwards according to your needs)
adjust=“BH” is the method for multiple testing (Benjamini-Hochberg in this case)

9- Exit R:

> q()

10- Recover original sequence names:

python /scratch/tharding/SCRIPTS/RecoverDESeqTable.py <gene_count_matrix_without_lowly_expressed_genes_generated_at_step_D-1> <limma_output_file_name>