Statistical analysis of gene expression data
Data: RNASeq data
Techniques: Multiple comparison testing, PCA, clustering
Code: Analysis of gene expression data
This document goes through the workflow for differential expression of RNA-Seq (gene expression) data, starting with raw counts of sequencing reads. Much of the workflow follows the vignette in this DESeq manual.
The Pasilla dataset: Contains RNA-Seq count data for RNAi treated and untreated Drosophila melanogaster cells.
Install gplots
, RColorBrewer
and Bioconductor packages DESeq
and pasilla
if needed.
The Workflow:
First get the path to the file containing the raw count data. The file countains counts for each gene(row) in each sample (column).
datafile <- system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
count.table <- read.table(datafile, header=TRUE, row.names=1)
The first few rows of data:
head(count.table)
## untreated1 untreated2 untreated3 untreated4 treated1 treated2
## FBgn0000003 0 0 0 0 0 0
## FBgn0000008 92 161 76 70 140 88
## FBgn0000014 5 1 0 0 4 0
## FBgn0000015 0 2 1 2 1 0
## FBgn0000017 4664 8714 3564 3150 6205 3072
## FBgn0000018 583 761 245 310 722 299
## treated3
## FBgn0000003 1
## FBgn0000008 70
## FBgn0000014 0
## FBgn0000015 0
## FBgn0000017 3334
## FBgn0000018 308
Optional: delete any genes that were never detected (equal to zero in all conditions).
count.table <- count.table[rowSums(count.table) > 0,]
Metadata- Columns will be condition or libType. Rows correspond to the 7 (untreated or treated) samples.
pasillaDesign = data.frame(
row.names = colnames( count.table ),
condition = c( "untreated", "untreated", "untreated",
"untreated", "treated", "treated", "treated" ),
libType = c( "single-end", "single-end", "paired-end",
"paired-end", "single-end", "paired-end", "paired-end" ) )
This dataset contains RNA-Seq counts for RNAi treated vs untreated cells, as well as sequencing data using both single-end sequencing (reading fragments from only one end to the other) and paired-end sequencing (reading form both ends).
pairedSamples = pasillaDesign$libType == "paired-end"
countTable = count.table[ , pairedSamples ]
condition = pasillaDesign$condition[ pairedSamples ]
head(countTable)
## untreated3 untreated4 treated2 treated3
## FBgn0000003 0 0 0 1
## FBgn0000008 76 70 88 70
## FBgn0000014 0 0 0 0
## FBgn0000015 1 2 0 0
## FBgn0000017 3564 3150 3072 3334
## FBgn0000018 245 310 299 308
cds = newCountDataSet( countTable, condition )
Normalize the expression values of each treatment by dividing each column with its own size factor using the estimateSizeFactors function. This estimates the size factor by first taking each column and dividing by the geometric mean of the rows. The median of these ratios is used as the size factor for this column.
cds = estimateSizeFactors( cds )
sizeFactors( cds )
## untreated3 untreated4 treated2 treated3
## 0.8730966 1.0106112 1.0224517 1.1145888
This divides each column by the size factor, which makes them comparable.
head( counts( cds, normalized=TRUE ) )
## untreated3 untreated4 treated2 treated3
## FBgn0000003 0.000000 0.00000 0.00000 0.8971919
## FBgn0000008 87.046493 69.26502 86.06763 62.8034302
## FBgn0000014 0.000000 0.00000 0.00000 0.0000000
## FBgn0000015 1.145349 1.97900 0.00000 0.0000000
## FBgn0000017 4082.022370 3116.92579 3004.54278 2991.2376629
## FBgn0000018 280.610404 306.74508 292.43434 276.3350930
With DESeq, differential gene expression is approximated by a negative binomial distribution, for which we need the dispersion parameter. Estimating the dispersion parameter:
cds = estimateDispersions( cds )
The level of dispersion is related to the biological dispersion in each treatment.The more variation, the bigger the difference between counts between treatments is required before the differences become significant. Plot dispersion estimates:
plotDispEsts( cds )
See whether there is differential expression between untreated and treated. We need to correct for the fact that we are doing multiple comparison tests. In the case of gene expression data, it’s typical to control for the FDR by using methods like Benjamini-Hochberg, instead of the Bonferroni correction, which is too conservative and generally would lead to a lot of false negatives.
Output is a data.frame.
res = nbinomTest( cds, "untreated", "treated" )
head(res)
## id baseMean baseMeanA baseMeanB foldChange
## 1 FBgn0000003 0.2242980 0.000000 0.4485959 Inf
## 2 FBgn0000008 76.2956431 78.155755 74.4355310 0.9523999
## 3 FBgn0000014 0.0000000 0.000000 0.0000000 NaN
## 4 FBgn0000015 0.7810873 1.562175 0.0000000 0.0000000
## 5 FBgn0000017 3298.6821506 3599.474078 2997.8902236 0.8328690
## 6 FBgn0000018 289.0312286 293.677741 284.3847165 0.9683564
## log2FoldChange pval padj
## 1 Inf 1.0000000 1.0000000
## 2 -0.07036067 0.8354725 1.0000000
## 3 NaN NA NA
## 4 -Inf 0.4160556 1.0000000
## 5 -0.26383857 0.2414208 0.8811746
## 6 -0.04638999 0.7572819 1.0000000
padj is the adjusted p-value (controlled for FDR). To order by pad-j (decreasing):
res <- res[order(res$padj),]
head(res)
## id baseMean baseMeanA baseMeanB foldChange log2FoldChange
## 8817 FBgn0039155 463.4369 884.9640 41.90977 0.0473576 -4.400260
## 2132 FBgn0025111 1340.2282 311.1697 2369.28680 7.6141316 2.928680
## 570 FBgn0003360 2544.2512 4513.9457 574.55683 0.1272848 -2.973868
## 2889 FBgn0029167 2551.3113 4210.9571 891.66551 0.2117489 -2.239574
## 9234 FBgn0039827 188.5927 357.3299 19.85557 0.0555665 -4.169641
## 6265 FBgn0035085 447.2485 761.1898 133.30718 0.1751300 -2.513502
## pval padj
## 8817 1.641210e-124 1.887556e-120
## 2132 3.496915e-107 2.010901e-103
## 570 1.552884e-99 5.953239e-96
## 2889 4.346335e-78 1.249680e-74
## 9234 1.189136e-65 2.735251e-62
## 6265 3.145997e-56 6.030352e-53
Plot log2fold changes against mean normalised counts for untreated vs treated.
plotMA(res, col = ifelse(res$padj >=0.01, "black", "violet"))
abline(h=c(-1:1), col="red")
Select gene names based on FDR (1%)
gene.kept <- res$id[res$padj <= 0.01 & !is.na(res$padj)]
Create a count data set with multiple factors
cdsFull = newCountDataSet(count.table, pasillaDesign)
Estimating size factors
cdsFull = estimateSizeFactors( cdsFull )
Estimating dispersions
cdsFull = estimateDispersions( cdsFull )
Variance stabilizing transformation, which will be useful for certain applications:
cdsBlind = estimateDispersions( cds, method="blind" )
vsd = varianceStabilizingTransformation( cdsBlind )
A heatmap of the count table from the variance stabilisation transformed data for the 20 most highly expressed genes:
cdsFullBlind = estimateDispersions( cdsFull, method = "blind" )
vsdFull = varianceStabilizingTransformation( cdsFullBlind )
select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:20]
hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6))
Sample clustering:
distances = dist( t( exprs(vsdFull) ) )
mat = as.matrix( distances )
rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : "))
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
In the case of gene expression data, it’s important to verify that the first principal component is your condition- in this case, treated and untreated (and not technical method of sequencing, like paired or single end reads). PCA plot of the samples:
print(plotPCA(vsdFull, intgroup=c("condition", "libType")))