--- title: "Computer lab, Bioinformatics: Omics data and pathway analysis" author: "Perry Moerland" date: "Thursday, February 25, 2021" urlcolor: blue output: unilur::tutorial_pdf_solution: number_sections: true unilur::tutorial_html_solution: toc: true toc_float: true solution_suffix: "" theme: default number_sections: true unilur::tutorial_html: toc: true toc_float: true theme: default number_sections: true unilur::tutorial_pdf: number_sections: true --- ```{r setup,include=FALSE,purl=FALSE} knitr::opts_template$set(solution = list(box.title = "Answer", box.body = list(fill = "white"), box.collapse = FALSE)) knitr::opts_chunk$set(echo=TRUE,comment="#",fig.path='Figures/Figure-',fig.align='center',fig.width=7,fig.height=7,out.width="60%",fig.show="hold") r <- getOption("repos") r["CRAN"] <- "https://cloud.r-project.org/" options(repos = r) ``` We will continue with the differential expression analysis of the dataset analyzed last time. **If you didn't do so yet**, let us first install the different packages that we'll need. Remember that in order to execute R code from within RStudio, just click the green arrow head in the chunk of code below or put the cursor somewhere in the chunk and select Run - Run Current Chunk from the menu. You can also execute the code line by line using Ctrl-Enter. If you get a message whether packages should be compiled from source or whether old packages should be updated, just type "n" in the console. ```{r install-packages,warning=FALSE,message=FALSE,eval=FALSE} # Installing packages should now be a piece of cake ... # The commented line is needed for the L0 desktops, but can be skipped on other systems #.libPaths("C:/Scratch") install.packages("BiocManager") BiocManager::install(c("Biobase","annotate","arrayQualityMetrics","caret","e1071","genefilter", "hgu133a.db","lattice","limma","NMF","pamr","RColorBrewer")) ``` Now load the libraries so that you can use the functions defined in them: ```{r load-packages,warning=FALSE,message=FALSE} library(annotate) library(arrayQualityMetrics) library(Biobase) library(caret) library(genefilter) library(hgu133a.db) library(lattice) library(limma) library(NMF) ``` Then retrieve the RMA-normalized data for the experiment of ([Winn et al., 2009](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2630905/?report=reader)) that we generated last time, also including the information about the scan dates: ```{r eset,warning=FALSE,message=FALSE} download.file("http://bioinformatics.amc.nl/wp-content/uploads/gs-bioinformatics/PathwaysNetworks/eset.zip", destfile="eset.zip") unzip("eset.zip") load("eset.RData") # Sample annotation: pData(eset) ``` # Differential expression A flexible way of analyzing microarray experiments uses _linear models_ ([Smyth, 2005](http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf)). The idea is that an experiment can be concisely described using matrices and simple linear algebra. A key ingredient of a linear model is the _design matrix_. Each row of the design matrix corresponds to an array in the experiment and each column corresponds to a coefficient which is used to describe the RNA sources in the experiment. In this case the design matrix can be specified as follows: ```{r design,warning=FALSE,message=FALSE} design <- model.matrix(~0+pData(eset)$condition) colnames(design) <- c("preeclampsia","preterm") design ``` The first column of design matrix contains only 1's for the preeclampsia arrays and the second column contains only 1's for the preterm samples. The following command fits the actual linear model: ```{r lmfit,warning=FALSE, message=FALSE} fit <- lmFit(eset,design) ``` The `fit` object you just made contains coefficients. The first coefficient corresponds to the mean log-intensity for the preeclampsia samples and the second coeffcient to the mean log-intensity for the preterm samples: ```{r fit,warning=FALSE,message=FALSE} # Only show this for the first probeset. First the coefficient in the linear # model for the preeclampsia samples fit$coef[1,1] # Then calculate the mean 'by hand' mean(exprs(eset)[1,as.character(pData(eset)$condition)=="preeclampsia"]) # Idem for the preterm samples fit$coef[1,2] mean(exprs(eset)[1,as.character(pData(eset)$condition)=="preterm"]) ``` To find differentially expressed genes, we have to make a _contrast matrix_, which allows the coefficients defined by the design matrix to be combined into contrasts or comparisons of interest. The contrasts are arithmetic combinations of the parameters estimated in the model. The contrast matrix must have a number of rows equal to the number of coefficients in the linear model. Each column in the contrast matrix corresponds to a different contrast of interest where the rows correspond to the parameters estimated by the linear model fit. A contrast,in the contrast matrix, consists of a linear combination of the effects (coefficients) in the linear model. For our dataset we are interested in the difference between preeclampsia and preterm samples as a contrast.Note that since we always work on a log2 scale and because log2(preeclampsia) - log2(preterm) = log2(preeclampsia/preterm) this represents the log-ratio: ```{r contrast,warning=FALSE,message=FALSE} # Specify the comparison of interest cont.matrix <- makeContrasts(PreeclampsiaVsPreterm = preeclampsia - preterm, levels = design) cont.matrix fit2 <- contrasts.fit(fit, cont.matrix) # Indeed this gives the difference of the mean log-intensities (only show this for # the first probeset) fit2$coef[1,1] fit$coef[1,1] - fit$coef[1,2] ``` Above we used the classical t-test for assessing statistical significance. However, since a typical microarray experiment is noisy and the number of arrays per group tends to be small, the standard error (=denominator of t-statistic) is hard to estimate reliably for each gene individually. The consequence is that some genes have small p-values only because, by chance, the denominator of the t-statistic was very small. Several researchers have proposed alternative statistics to obtain a more stable estimate of the gene-specific variance. In the **limma** package this has been implemented via a moderated t-statistic, details are described in ([Smyth, 2005](http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf)). The moderated t-statistic is simply calculated from `fit2` using the function `eBayes`: ```{r toptable,warning=FALSE, message=FALSE} fit2.eb <- eBayes(fit2) # Extract a table from a linear model fit ordered by nominal p-value tt.limma <- topTable(fit2.eb,n=Inf) # Number of differentially expressed probesets with an FDR < 0.05 sum(tt.limma$adj<0.05) # Select the ten most differentially expressed genes tt.limma[1:10,] ``` **Question 19** Compare results with the results obtained using a classical t-test and with GEO2R [results](http://bioinformatics.amc.nl/wp-content/uploads/gs-bioinformatics/OmicsData/GEO2R_GSE14722.xlsx). Can you explain the differences? ```{block answer-q19,solution=TRUE,purl=FALSE} Just looking at the top-10, we observe only very minor differences between the limma results (_tt.limma_) and the [results](http://bioinformatics.amc.nl/wp-content/uploads/gs-bioinformatics/OmicsData/GEO2R_GSE14722.xlsx) from GEO2R. It is actually strange that there are differences at all, since both differential expression analyses are identical. You can inspect the R code used by GEO2R by clicking on the _R script_ tab.You'll notice that it is indeed similar to the code you used above (but more complicated since more generic). However, it turns out that the RMA-normalized data avaiable on GEO as submitted by the authors is slightly different from our RMA-normalized data. See the code below for a comparison of the expression values for probeset 31637_s_at in our data with those listed for the different GSMs for [GSE14722](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14722). This is probably due to small changes in the RMA implementation over the past few years. Comparing the limma results (_tt.limma_) and the results with a classical t-test shows that the log fold-changes are (of course) identical (columns _logFC_ and _dm_) respectively (see code below). However, p-values (and therefore also the corresponding FDRs or adjusted p-values) are generally smaller in the limma results. In general the moderated t-statistics from limma will be more reliable than the classical ones, especially for a relatively small number of samples per group as is the case here. ``` ```{r exprs,solution=TRUE,box.title="R code",warning=FALSE,message=FALSE} # Our expression values for 31637_s_at exprs(eset)["31637_s_at",] ``` ```{r rowttests,solution=TRUE,box.title="R code",warning=FALSE,message=FALSE} # Results for a classical t-test tt <- rowttests(exprs(eset),as.factor(pData(eset)$condition)) # Use 'match' to have the probesets in the same order plot(tt.limma$P.Value,tt$p.value[match(rownames(tt.limma),rownames(tt))],xlim=c(0,0.1), ylim=c(0,0.1),cex=0.4) abline(0,1,col="red") ``` *** Let us now try to take the batch effects observed between the different scan dates into account and see what we gain by doing so. In a linear modeling framework this is easily done by extending our previous linear model with the scan dates as a _blocking_ variable: ```{r scan-dates,warning=FALSE, message=FALSE} # Switch order from (preeclampsia,preterm) to (preterm,preeclampsia) pData(eset)$condition <- relevel(pData(eset)$condition, "preterm") # Include scan date as a blocking variable design <- model.matrix(~pData(eset)$condition + pData(eset)$scandate) # This is a more complicated design than the previous one. I will explain this some time # this afternoon head(design,n=4) tail(design,n=4) fit <- lmFit(eset,design) ``` We can now estimate differential expression between preeclamptic and preterm samples corrected for batch effects: ```{r diff-exp,warning=FALSE, message=FALSE} fit.eb <- eBayes(fit) # The second coefficient now corresponds to the comparison of interest tt.pvsp <- topTable(fit.eb,coef=2,n=Inf) # Number of differentially expressed probesets with an FDR < 0.05 sum(tt.pvsp$adj<0.05) # Select the ten most differentially expressed genes tt.pvsp[1:10,] ``` **Question 20** Did we gain anything by correcting for the batch effect? ```{block answer-q20,solution=TRUE,purl=FALSE} Yes, the number of genes differentially expressed with an adjusted p-value < 0.05 almost doubled from 33 to 63. ``` *** The most differentially expressed gene is 203140_at (BCL6, we'll come back to annotating the probesets below). Let's visualize its expression values: ```{r stripplot,fig.width=12} # Create a panel plot stripplot(exprs(eset)["203140_at",] ~ pData(eset)$condition | factor(pData(eset)$scandate), layout = c(5, 1)) ``` We can use the same model to estimate differential expression between the five different batches: ```{r batch-effect,warning=FALSE,message=FALSE} # Coefficients 3-6 correspond to the pairwise comparisons of batch '02/11/05' with the # other batches tt.batch <- topTable(fit.eb,coef=3:6,n=Inf) sum(tt.batch$adj<0.05) # Select the ten most differentially expressed genes tt.batch[1:10,] ``` Differences between batches are impressive indeed! Note that many of the top hits are hybridization controls. Let's visualize the expression values of one of the genes most affected by the batch effect (213350_at: RPS11): ```{r stripplot-2,fig.width=12} # Create a panel plot stripplot(exprs(eset)["213350_at",] ~ pData(eset)$condition | factor(pData(eset)$scandate), layout = c(5, 1)) ``` Batch corrected expression values corresponding to the linear model-based batch correction above can be generated using the function `removeBatchEffect`: ```{r remove-batch,warning=FALSE,message=FALSE} eset.corrected <- eset exprs(eset.corrected) <- removeBatchEffect(exprs(eset), batch=factor(pData(eset)$scandate), design=model.matrix(~pData(eset)$condition)) ``` We regenerate an arrayQualityMetrics report but now on the batch corrected data: ```{r aqm,warning=FALSE, message=FALSE} if (!dir.exists("aQ_corrected")){ arrayQualityMetrics(eset.corrected,outdir="aQ_corrected", intgroup=c("condition","scandate"),force=TRUE) } ``` **Question 21** Do you still observe batch effects and outliers after the correction? What could be a reason of the large remaining within-group heterogeneity (think about the way the study was set up). ```{block answer-q21,solution=TRUE,purl=FALSE} No outliers are detected and no clear batch effects can be seen. Part of the heterogeneity might be explained by the large variation in the number of weeks at which the placenta was obtained (24-36 wk in both groups). You might actually want to include this in your linear model as another covariate in a full-fledged analysis. The information is available at [GSE14722](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14722). ``` Take home message: batch effects are pervasive and their effect is often blissfully ignored or underestimated. See [Leek et al., 2010](http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html) for a good review, illustrated with more examples.See also the lively [discussion](http://www.nature.com/news/potential-flaws-in-genomics-paper-scrutinized-on-twitter-1.17591) on a recent comparison of the transcriptional landscapes between human and mouse tissues. If you want to learn more about linear models, see the [Biomedical Data Science](http://genomicsclass.github.io/book/) course of Rafael Irizarry et al. *** The tables that we generated above only contained probeset identifiers. In general you would like to add gene annotation to these tables. For Affymetrix arrays this is easily done as follows: ```{r annotation,warning=FALSE,message=FALSE} probesets <- as.character(rownames(tt.pvsp)) # Retrieve Entrez Gene IDs for the U133A chip eg <- getEG(probesets, "hgu133a") # Retrieve gene symbols for the U133A chip sym <- getSYMBOL(probesets, "hgu133a") tt.pvsp$ID <- as.character(rownames(tt.pvsp)) # Generate an HTML table with annotation htmlpage(as.data.frame(eg), filename = "report.html", title = "HTML report", othernames = data.frame(sym, tt.pvsp), table.head = c("GeneID","Symbol", colnames(tt.pvsp)), table.center = TRUE) ``` Also save the results in a tab-delmited text file: ```{r write-table,warning=FALSE,message=FALSE} write.table(data.frame(GeneID=eg,Symbol=sym,tt.pvsp),file="tt_GSE14722.txt",sep="\t", row.names=FALSE) ``` # Geneset enrichment analysis **Question 22** Use the [DAVID Functional Annotation Tool](http://david.abcc.ncifcrf.gov/summary.jsp) to find genesets that are enriched for the genes probed in this experiment. In the lecture three types of gene set analyses approaches were described which of the approaches does DAVID use? Things to keep in mind when using DAVID: * What selection of genes do you use? * Think about using an appropriate background set. * Which identifier to use? Investigate the KEGG pathway results and the tissue expression results in particular. ```{block answer-q22,solution=TRUE,purl=FALSE} DAVID takes a cut-off based overrepresentation approach (ORA). I used a selection based on those probesets with a nominal P-value <0.01 in the limma analysis when taking the batch effect into account; see the file tt_GSE14722.txt that you generated above. In general, for an ORA-type analysis the list of selected genes should not be too short (<100 genes) nor too long (>1000 genes). Here, the selection criterion leads to a list of 674 probesets. An ORA-type analysis using DAVID then consists of three steps, illustrated in the workflow below: * Paste the selected identfiers under 'Upload'. Here I decided to use Affymetrix probeset IDs and therefore selected 'AFFYMETRIX_3PRIME_IVT_ID' and indicated that the list type is 'Gene List'; * Then select 'Homo Sapiens' as species under 'List'; * Finally, under 'Background' select the platform used in the experiment, that is 'Human Genome U133A Array'. Results for various geneset categories are indicated in the 'Annotation Summary Results'. Here results for the categories 'Tissue_Expression' and 'Pathways' are unfolded. For the 'Tissue_Expression' results it is reassuring to see that the signature is enriched for placenta-specific genes (UP_TISSUE). One of the most enriched KEGG pathways is [_Alanine, aspartate and glutamate metabolism_](http://www.genome.jp/kegg-bin/show_pathway?hsa00250), which has indeed been implied in preeclampsia before (see [Pennington et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3255538/)). ``` ```{r include-graphics,solution=TRUE,box.title="Workflow",echo=FALSE,purl=FALSE} knitr::include_graphics("DAVID.jpg") ``` Let us now have a closer look at the alanine, aspartate and glutamate metabolism pathway: ```{r aheatmap,warning=FALSE,message=FALSE} # Select the related genes (KEGG ID: 00250) hsa00250 <- AnnotationDbi::select(hgu133a.db, keys = "00250", keytype = "PATH", columns = c("PROBEID", "ENTREZID", "SYMBOL", "GENENAME")) # Extract and scale the expression data E <- exprs(eset[hsa00250$PROBEID, ]) E <- t(scale(t(E), center = TRUE, scale = FALSE)) # Visualize the expression of genes related to alanine, aspartate and glutamate # metabolism in a heatmap aheatmap(E, dist = "pearson", col = "-RdBu:256", breaks = 0, labCol = pData(eset)$condition) ``` Restricting clustering to this pathway is still far from perfect, since many genes in the pathway are not differentially expressed. We could also do a DAVID-type enrichment analysis in R by just constructing the 2x2 contingency table: ```{r fisher-exact,warning=FALSE,message=FALSE} # Define DE genes based on thresholds diffExpr <- decideTests(fit.eb, p.value = 0.01, adjust.method = "none") # Number of DE genes per coefficient summary(diffExpr) # Select the comparison of interest: preeclampsia vs.preterm diffExpr <- diffExpr[,2] diffExpr <- diffExpr != 0 # Indicate members of the alanine, aspartate and glutamate metabolism pathway inPathway <- featureNames(eset) %in% hsa00250$PROBEID # Construct contingency table tab <- table(diffExpr, inPathway) tab # Perform an enrichment test using Fisher's exact test fisher.test(tab, alternative = "greater") ``` Indeed, the signature is clearly enriched for genes part of the alanine, aspartate and glutamate metabolism pathway. **Question 23** If there is still time left, you can work with two other enrichment tools: * [Graphite Web](https://graphiteweb.bio.unipd.it/analyze.html) * [STRING](http://string-db.org/) # Classification We will now have a brief look at how to construct a gene expression-based classifier to predict patient status (preeclampsia/preterm). The **caret** (short for Classification And REgression Training) [package](http://topepo.github.io/caret/index.html) offers a versatile set of functions for this purpose. The typical steps in building a classifier are: First, split the data into a training and test set: ```{r split,warning=FALSE,message=FALSE} # Fix the random seed to make the exercise reproducible set.seed(13) # training set: 75%, test set: 25% inTrain <- createDataPartition(y = pData(eset)$condition,p = .75,list = FALSE) training <- eset[ ,inTrain] testing <- eset[,-inTrain] ncol(training) ncol(testing) ``` Second, fit a model using the training data (both the gene expression data and the class labels). Here, we use the _nearest shrunken centroids_ model also refered to as PAM (Prediction Analysis for Microarrays). This is a simple linear model, proposed in 2002 by Rob Tibshirani ([paper](http://www.pnas.org/content/99/10/6567.full)). While fitting the model, this approach also identifies subsets of genes that best characterize each class. ```{r pamr,warning=FALSE,message=FALSE} pamrFit <- train(x=t(exprs(training)), y=pData(training)$condition, method="pam", tuneLength=10,trControl = trainControl(method = "cv")) pamrFit ``` The optimal number of features is determined by a threshold value and the optimal threshold value is selected using cross-validation. Third the optimal model is used to predict the class labels for the test data: ```{r confusion-matrix,warning=FALSE,message=FALSE} pamrClasses <- predict(pamrFit, newdata = t(exprs(testing))) confusionMatrix(data = pamrClasses, pData(testing)$condition) ``` In this case, this leads to a model that makes one error on the five test samples. Of course, one should split the data into a training and a test set multiple times to obtain more reliable estimates of the accuracy of this model. Finally, let us have a look at the most important variables in the model: ```{r feature-selection,warning=FALSE,message=FALSE} varimps <- varImp(pamrFit) getSYMBOL(head(featureNames(training)[order(varimps$importance[,1],decreasing=TRUE)]),"hgu133a") ``` Not surprisingly, the most important variables are those that are strongly differentially expressed.