---
title: "Computer lab, Bioinformatics: Omics data and pathway analysis"
author: "Perry Moerland"
date: "Thursday, February 25, 2021"
urlcolor: blue
output:
unilur::tutorial_pdf_solution:
number_sections: true
unilur::tutorial_html_solution:
toc: true
toc_float: true
solution_suffix: ""
theme: default
number_sections: true
unilur::tutorial_html:
toc: true
toc_float: true
theme: default
number_sections: true
unilur::tutorial_pdf:
number_sections: true
---
```{r setup,include=FALSE,purl=FALSE}
knitr::opts_template$set(solution = list(box.title = "Answer",
box.body = list(fill = "white"),
box.collapse = FALSE))
knitr::opts_chunk$set(echo=TRUE,comment="#",fig.path='Figures/Figure-',fig.align='center',fig.width=7,fig.height=7,out.width="60%",fig.show="hold")
r <- getOption("repos")
r["CRAN"] <- "https://cloud.r-project.org/"
options(repos = r)
```
We will continue with the differential expression analysis of the dataset analyzed
last time. **If you didn't do so yet**, let us first install the different packages that we'll need. Remember
that in order to execute R code from within RStudio, just click the green arrow
head in the chunk of code below or put the cursor somewhere in the chunk and
select Run - Run Current Chunk from the menu. You can also execute the code line
by line using Ctrl-Enter. If you get a message whether packages should be compiled
from source or whether old packages should be updated, just type "n" in the console.
```{r install-packages,warning=FALSE,message=FALSE,eval=FALSE}
# Installing packages should now be a piece of cake ...
# The commented line is needed for the L0 desktops, but can be skipped on other systems
#.libPaths("C:/Scratch")
install.packages("BiocManager")
BiocManager::install(c("Biobase","annotate","arrayQualityMetrics","caret","e1071","genefilter",
"hgu133a.db","lattice","limma","NMF","pamr","RColorBrewer"))
```
Now load the libraries so that you can use the functions defined in them:
```{r load-packages,warning=FALSE,message=FALSE}
library(annotate)
library(arrayQualityMetrics)
library(Biobase)
library(caret)
library(genefilter)
library(hgu133a.db)
library(lattice)
library(limma)
library(NMF)
```
Then retrieve the RMA-normalized data for the experiment of
([Winn et al., 2009](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2630905/?report=reader))
that we generated last time, also including the information about the scan
dates:
```{r eset,warning=FALSE,message=FALSE}
download.file("http://bioinformatics.amc.nl/wp-content/uploads/gs-bioinformatics/PathwaysNetworks/eset.zip", destfile="eset.zip")
unzip("eset.zip")
load("eset.RData")
# Sample annotation:
pData(eset)
```
# Differential expression
A flexible way of analyzing microarray experiments uses _linear models_
([Smyth, 2005](http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf)).
The idea is that an experiment can be concisely described using matrices and
simple linear algebra. A key ingredient of a linear model is the
_design matrix_. Each row of the design matrix corresponds to an array in the
experiment and each column corresponds to a coefficient which is used to
describe the RNA sources in the experiment. In this case the design matrix can
be specified as follows:
```{r design,warning=FALSE,message=FALSE}
design <- model.matrix(~0+pData(eset)$condition)
colnames(design) <- c("preeclampsia","preterm")
design
```
The first column of design matrix contains only 1's for the preeclampsia
arrays and the second column contains only 1's for the preterm samples.
The following command fits the actual linear model:
```{r lmfit,warning=FALSE, message=FALSE}
fit <- lmFit(eset,design)
```
The `fit` object you just made contains coefficients. The first coefficient
corresponds to the mean log-intensity for the preeclampsia samples and the
second coeffcient to the mean log-intensity for the preterm samples:
```{r fit,warning=FALSE,message=FALSE}
# Only show this for the first probeset. First the coefficient in the linear
# model for the preeclampsia samples
fit$coef[1,1]
# Then calculate the mean 'by hand'
mean(exprs(eset)[1,as.character(pData(eset)$condition)=="preeclampsia"])
# Idem for the preterm samples
fit$coef[1,2]
mean(exprs(eset)[1,as.character(pData(eset)$condition)=="preterm"])
```
To find differentially expressed genes, we have to make a _contrast matrix_,
which allows the coefficients defined by the design matrix to be combined into
contrasts or comparisons of interest. The contrasts are arithmetic combinations of the
parameters estimated in the model. The contrast matrix must have a number of
rows equal to the number of coefficients in the linear model. Each column in the
contrast matrix corresponds to a different contrast of interest where the rows
correspond to the parameters estimated by the linear model fit. A contrast,in
the contrast matrix, consists of a linear combination of the effects
(coefficients) in the linear model. For our dataset we are interested in the
difference between preeclampsia and preterm samples as a contrast.Note that
since we always work on a log2 scale and because
log2(preeclampsia) - log2(preterm) = log2(preeclampsia/preterm)
this represents the log-ratio:
```{r contrast,warning=FALSE,message=FALSE}
# Specify the comparison of interest
cont.matrix <- makeContrasts(PreeclampsiaVsPreterm = preeclampsia - preterm, levels = design)
cont.matrix
fit2 <- contrasts.fit(fit, cont.matrix)
# Indeed this gives the difference of the mean log-intensities (only show this for
# the first probeset)
fit2$coef[1,1]
fit$coef[1,1] - fit$coef[1,2]
```
Above we used the classical t-test for assessing statistical significance.
However, since a typical microarray experiment is noisy and the number of arrays
per group tends to be small, the standard error (=denominator of t-statistic) is
hard to estimate reliably for each gene individually. The consequence is that some genes have small
p-values only because, by chance, the denominator of the t-statistic was very
small. Several researchers have proposed alternative statistics to obtain a
more stable estimate of the gene-specific variance. In the **limma** package this
has been implemented via a moderated t-statistic, details are described in
([Smyth, 2005](http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf)).
The moderated t-statistic is simply calculated from `fit2` using the function
`eBayes`:
```{r toptable,warning=FALSE, message=FALSE}
fit2.eb <- eBayes(fit2)
# Extract a table from a linear model fit ordered by nominal p-value
tt.limma <- topTable(fit2.eb,n=Inf)
# Number of differentially expressed probesets with an FDR < 0.05
sum(tt.limma$adj<0.05)
# Select the ten most differentially expressed genes
tt.limma[1:10,]
```
**Question 19**
Compare results with the results obtained using a classical t-test and with
GEO2R [results](http://bioinformatics.amc.nl/wp-content/uploads/gs-bioinformatics/OmicsData/GEO2R_GSE14722.xlsx). Can you explain the differences?
```{block answer-q19,solution=TRUE,purl=FALSE}
Just looking at the top-10, we observe only very minor differences between the
limma results (_tt.limma_) and the
[results](http://bioinformatics.amc.nl/wp-content/uploads/gs-bioinformatics/OmicsData/GEO2R_GSE14722.xlsx)
from GEO2R. It is actually strange that there are differences at all, since
both differential expression analyses are identical. You can inspect the R code
used by GEO2R by clicking on the _R script_ tab.You'll notice that it is indeed
similar to the code you used above (but more complicated since more generic).
However, it turns out that the RMA-normalized data avaiable on GEO as submitted
by the authors is slightly different from our RMA-normalized data. See the code
below for a comparison of the expression values for probeset 31637_s_at in our data with those listed for
the different GSMs for [GSE14722](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14722).
This is probably due to small changes in the RMA implementation over the past
few years.
Comparing the limma results (_tt.limma_) and the results with a classical
t-test shows that the log fold-changes are (of course) identical (columns
_logFC_ and _dm_) respectively (see code below). However, p-values (and therefore also the
corresponding FDRs or adjusted p-values) are generally smaller in the limma
results. In general the moderated t-statistics from limma will be more reliable than the
classical ones, especially for a relatively small number of samples per group as
is the case here.
```
```{r exprs,solution=TRUE,box.title="R code",warning=FALSE,message=FALSE}
# Our expression values for 31637_s_at
exprs(eset)["31637_s_at",]
```
```{r rowttests,solution=TRUE,box.title="R code",warning=FALSE,message=FALSE}
# Results for a classical t-test
tt <- rowttests(exprs(eset),as.factor(pData(eset)$condition))
# Use 'match' to have the probesets in the same order
plot(tt.limma$P.Value,tt$p.value[match(rownames(tt.limma),rownames(tt))],xlim=c(0,0.1),
ylim=c(0,0.1),cex=0.4)
abline(0,1,col="red")
```
***
Let us now try to take the batch effects observed between the different scan
dates into account and see what we gain by doing so. In a linear modeling
framework this is easily done by extending our previous linear model with the
scan dates as a _blocking_ variable:
```{r scan-dates,warning=FALSE, message=FALSE}
# Switch order from (preeclampsia,preterm) to (preterm,preeclampsia)
pData(eset)$condition <- relevel(pData(eset)$condition, "preterm")
# Include scan date as a blocking variable
design <- model.matrix(~pData(eset)$condition + pData(eset)$scandate)
# This is a more complicated design than the previous one. I will explain this some time
# this afternoon
head(design,n=4)
tail(design,n=4)
fit <- lmFit(eset,design)
```
We can now estimate differential expression between preeclamptic and preterm
samples corrected for batch effects:
```{r diff-exp,warning=FALSE, message=FALSE}
fit.eb <- eBayes(fit)
# The second coefficient now corresponds to the comparison of interest
tt.pvsp <- topTable(fit.eb,coef=2,n=Inf)
# Number of differentially expressed probesets with an FDR < 0.05
sum(tt.pvsp$adj<0.05)
# Select the ten most differentially expressed genes
tt.pvsp[1:10,]
```
**Question 20**
Did we gain anything by correcting for the batch effect?
```{block answer-q20,solution=TRUE,purl=FALSE}
Yes, the number of genes differentially expressed with an adjusted p-value <
0.05 almost doubled from 33 to 63.
```
***
The most differentially expressed gene is 203140_at (BCL6, we'll come back to annotating
the probesets below). Let's visualize its expression values:
```{r stripplot,fig.width=12}
# Create a panel plot
stripplot(exprs(eset)["203140_at",] ~ pData(eset)$condition |
factor(pData(eset)$scandate), layout = c(5, 1))
```
We can use the same model to estimate differential expression between the five
different batches:
```{r batch-effect,warning=FALSE,message=FALSE}
# Coefficients 3-6 correspond to the pairwise comparisons of batch '02/11/05' with the
# other batches
tt.batch <- topTable(fit.eb,coef=3:6,n=Inf)
sum(tt.batch$adj<0.05)
# Select the ten most differentially expressed genes
tt.batch[1:10,]
```
Differences between batches are impressive indeed! Note that many of the top
hits are hybridization controls. Let's visualize the expression values of one of
the genes most affected by the batch effect (213350_at: RPS11):
```{r stripplot-2,fig.width=12}
# Create a panel plot
stripplot(exprs(eset)["213350_at",] ~ pData(eset)$condition |
factor(pData(eset)$scandate), layout = c(5, 1))
```
Batch corrected expression values corresponding to the linear model-based batch
correction above can be generated using the function `removeBatchEffect`:
```{r remove-batch,warning=FALSE,message=FALSE}
eset.corrected <- eset
exprs(eset.corrected) <- removeBatchEffect(exprs(eset), batch=factor(pData(eset)$scandate),
design=model.matrix(~pData(eset)$condition))
```
We regenerate an arrayQualityMetrics report but now on the batch corrected
data:
```{r aqm,warning=FALSE, message=FALSE}
if (!dir.exists("aQ_corrected")){
arrayQualityMetrics(eset.corrected,outdir="aQ_corrected",
intgroup=c("condition","scandate"),force=TRUE)
}
```
**Question 21**
Do you still observe batch effects and outliers after the correction? What
could be a reason of the large remaining within-group heterogeneity (think about
the way the study was set up).
```{block answer-q21,solution=TRUE,purl=FALSE}
No outliers are detected and no clear batch effects can be seen. Part of the
heterogeneity might be explained by the large variation in the number of weeks
at which the placenta was obtained (24-36 wk in both groups). You might
actually want to include this in your linear model as another covariate in a
full-fledged analysis. The information is available at
[GSE14722](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14722).
```
Take home message: batch effects are pervasive and their effect is often
blissfully ignored or underestimated. See
[Leek et al., 2010](http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html)
for a good review, illustrated with more examples.See also the lively
[discussion](http://www.nature.com/news/potential-flaws-in-genomics-paper-scrutinized-on-twitter-1.17591)
on a recent comparison of the transcriptional landscapes between human and mouse tissues.
If you want to learn more about linear models, see the [Biomedical Data Science](http://genomicsclass.github.io/book/)
course of Rafael Irizarry et al.
***
The tables that we generated above only contained probeset identifiers. In
general you would like to add gene annotation to these tables. For
Affymetrix arrays this is easily done as follows:
```{r annotation,warning=FALSE,message=FALSE}
probesets <- as.character(rownames(tt.pvsp))
# Retrieve Entrez Gene IDs for the U133A chip
eg <- getEG(probesets, "hgu133a")
# Retrieve gene symbols for the U133A chip
sym <- getSYMBOL(probesets, "hgu133a")
tt.pvsp$ID <- as.character(rownames(tt.pvsp))
# Generate an HTML table with annotation
htmlpage(as.data.frame(eg), filename = "report.html", title = "HTML report",
othernames = data.frame(sym, tt.pvsp),
table.head = c("GeneID","Symbol", colnames(tt.pvsp)),
table.center = TRUE)
```
Also save the results in a tab-delmited text file:
```{r write-table,warning=FALSE,message=FALSE}
write.table(data.frame(GeneID=eg,Symbol=sym,tt.pvsp),file="tt_GSE14722.txt",sep="\t",
row.names=FALSE)
```
# Geneset enrichment analysis
**Question 22**
Use the
[DAVID Functional Annotation Tool](http://david.abcc.ncifcrf.gov/summary.jsp)
to find genesets that are enriched for the genes probed in this experiment.
In the lecture three types of gene set analyses approaches were described which
of the approaches does DAVID use? Things to keep in mind when using DAVID:
* What selection of genes do you use?
* Think about using an appropriate background set.
* Which identifier to use?
Investigate the KEGG pathway results and the tissue expression results in particular.
```{block answer-q22,solution=TRUE,purl=FALSE}
DAVID takes a cut-off based overrepresentation approach (ORA). I used a selection based on
those probesets with a nominal P-value <0.01 in the limma analysis when taking the batch effect into
account; see the file tt_GSE14722.txt that you generated above. In general, for
an ORA-type analysis the list of selected genes should not be too short (<100 genes)
nor too long (>1000 genes). Here, the selection criterion leads to a list of
674 probesets. An ORA-type analysis using DAVID then consists of three steps,
illustrated in the workflow below:
* Paste the selected identfiers under 'Upload'. Here I decided to use
Affymetrix probeset IDs and therefore selected 'AFFYMETRIX_3PRIME_IVT_ID' and
indicated that the list type is 'Gene List';
* Then select 'Homo Sapiens' as species under 'List';
* Finally, under 'Background' select the platform used in the experiment, that is
'Human Genome U133A Array'.
Results for various geneset categories are indicated in the 'Annotation Summary
Results'. Here results for the categories 'Tissue_Expression' and 'Pathways' are
unfolded. For the 'Tissue_Expression' results it is reassuring to see that the signature
is enriched for placenta-specific genes (UP_TISSUE). One of the most enriched KEGG pathways
is [_Alanine, aspartate and glutamate metabolism_](http://www.genome.jp/kegg-bin/show_pathway?hsa00250), which has indeed been implied in preeclampsia
before (see [Pennington et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3255538/)).
```
```{r include-graphics,solution=TRUE,box.title="Workflow",echo=FALSE,purl=FALSE}
knitr::include_graphics("DAVID.jpg")
```
Let us now have a closer look at the alanine, aspartate and glutamate metabolism pathway:
```{r aheatmap,warning=FALSE,message=FALSE}
# Select the related genes (KEGG ID: 00250)
hsa00250 <- AnnotationDbi::select(hgu133a.db, keys = "00250", keytype = "PATH",
columns = c("PROBEID", "ENTREZID", "SYMBOL", "GENENAME"))
# Extract and scale the expression data
E <- exprs(eset[hsa00250$PROBEID, ])
E <- t(scale(t(E), center = TRUE, scale = FALSE))
# Visualize the expression of genes related to alanine, aspartate and glutamate
# metabolism in a heatmap
aheatmap(E, dist = "pearson", col = "-RdBu:256", breaks = 0, labCol = pData(eset)$condition)
```
Restricting clustering to this pathway is still far from perfect, since many genes
in the pathway are not differentially expressed.
We could also do a DAVID-type enrichment analysis in R by just constructing the
2x2 contingency table:
```{r fisher-exact,warning=FALSE,message=FALSE}
# Define DE genes based on thresholds
diffExpr <- decideTests(fit.eb, p.value = 0.01, adjust.method = "none")
# Number of DE genes per coefficient
summary(diffExpr)
# Select the comparison of interest: preeclampsia vs.preterm
diffExpr <- diffExpr[,2]
diffExpr <- diffExpr != 0
# Indicate members of the alanine, aspartate and glutamate metabolism pathway
inPathway <- featureNames(eset) %in% hsa00250$PROBEID
# Construct contingency table
tab <- table(diffExpr, inPathway)
tab
# Perform an enrichment test using Fisher's exact test
fisher.test(tab, alternative = "greater")
```
Indeed, the signature is clearly enriched for genes part of the alanine, aspartate and glutamate metabolism pathway.
**Question 23**
If there is still time left, you can work with two other enrichment tools:
* [Graphite Web](https://graphiteweb.bio.unipd.it/analyze.html)
* [STRING](http://string-db.org/)
# Classification
We will now have a brief look at how to construct a gene expression-based classifier
to predict patient status (preeclampsia/preterm). The **caret** (short for
Classification And REgression Training) [package](http://topepo.github.io/caret/index.html)
offers a versatile set of functions for this purpose. The typical steps in building
a classifier are:
First, split the data into a training and test set:
```{r split,warning=FALSE,message=FALSE}
# Fix the random seed to make the exercise reproducible
set.seed(13)
# training set: 75%, test set: 25%
inTrain <- createDataPartition(y = pData(eset)$condition,p = .75,list = FALSE)
training <- eset[ ,inTrain]
testing <- eset[,-inTrain]
ncol(training)
ncol(testing)
```
Second, fit a model using the training data (both the gene expression data and
the class labels). Here, we use the _nearest shrunken centroids_ model also
refered to as PAM (Prediction Analysis for Microarrays). This is a simple
linear model, proposed in 2002 by Rob Tibshirani ([paper](http://www.pnas.org/content/99/10/6567.full)).
While fitting the model, this approach also identifies subsets of genes that
best characterize each class.
```{r pamr,warning=FALSE,message=FALSE}
pamrFit <- train(x=t(exprs(training)), y=pData(training)$condition, method="pam",
tuneLength=10,trControl = trainControl(method = "cv"))
pamrFit
```
The optimal number of features is determined by a threshold value and the optimal
threshold value is selected using cross-validation.
Third the optimal model is used to predict the class labels for the test data:
```{r confusion-matrix,warning=FALSE,message=FALSE}
pamrClasses <- predict(pamrFit, newdata = t(exprs(testing)))
confusionMatrix(data = pamrClasses, pData(testing)$condition)
```
In this case, this leads to a model that makes one error on the five test samples. Of
course, one should split the data into a training and a test set multiple times
to obtain more reliable estimates of the accuracy of this model.
Finally, let us have a look at the most important variables in the model:
```{r feature-selection,warning=FALSE,message=FALSE}
varimps <- varImp(pamrFit)
getSYMBOL(head(featureNames(training)[order(varimps$importance[,1],decreasing=TRUE)]),"hgu133a")
```
Not surprisingly, the most important variables are those that are strongly
differentially expressed.