In these computer exercises you will mainly use the statistical software package R. Since we will focus on the interpretation of the results, no previous exposure to R is required. If you want to learn more about R, see our biannual AMC Graduate School course Computing in R.

The goal of this computer lab is to give you an overview of the techniques typically applied when analyzing omics data:

Unsupervised methods: clustering
Quality control and normalization
Differential expression analysis: statistical tests, multiple testing

First download the Rmd (Rmarkdown) file and open it in RStudio (Alle programma’s - R - RStudio, and ‘Ignore Update’). If you didn’t do so yet, first install the different packages that we’ll need. In order to execute R code from within RStudio, just click the green arrow head in the chunk of code shown below or put the cursor somewhere in the chunk and select Run - Run Current Chunk from the menu. You can also execute code line-by-line using Ctrl-Enter:

# If you didn't do so yet, first install the required packages. The commented line 
# is needed for the L0 desktops, but can be skipped on other systems
#.libPaths("C:/Scratch")
# Installation of packages might take a few minutes
# If in the console you are asked "Update all/some/none? [a/s/n]:". Just reply "n"
install.packages("BiocManager")
BiocManager::install(c("affy","arrayQualityMetrics","bioDist","genefilter","GenomeInfoDbData",
                       "hgu133acdf","limma","tibble","mclust","ClassDiscovery"))

Now load the libraries so that you can use the functions defined in them:

library(affy)
library(arrayQualityMetrics)
library(bioDist)
library(ClassDiscovery)
library(genefilter)
library(limma)
library(mclust)

1 Unsupervised methods

Unsupervised learning methods aim at detecting structures in data. The term unsupervised refers to the fact that these methods do not use gene or sample annotations, only the (normalized) gene expression values are used. A primary purpose of such methods is to group similar data together (clustering) and provide a visualization of the data in which structures can easily be recognized. These may be relations among genes, among samples, or even between genes and samples. The discovery of such structures can lead to the development of new hypotheses, e.g., the grouping of genes with similar expression profiles may indicate that they are co-regulated and are possibly involved in the same biological process. If one looks at samples instead of genes, the separation of expression profiles of patient tissue samples may point to a possible refinement of disease taxonomy. On the other hand, unsupervised methods are often used to confirm known differences between genes or samples. If a clustering algorithm groups samples from two different tumor types into distinct clusters, this provides evidence that the tumor types indeed show clearly detectable differences in their global expression profiles.

As you might have noticed, in the above the word similar was used several times. This is really a central concept in unsupervised learning be it clustering or visualization. In clustering one wants to group similar objects together, whereas in visualization one wants to find a representation of a high-dimensional data set in two or three dimensions while loosing as little information as possible: objects that are similar in the original data space should also be similar in the low-dimensional space.

Assume that we performed a mini-experiment with four samples (A,B,C,D) and four genes per sample. The resulting data set has been checked for low-quality samples and has been properly normalized. The resulting log-ratios are given in file hcexample.txt. Download this file and save it in the same folder as the current Rmd file.

The following piece of R code reads in hcexample.txt and then plots the sample profiles:

E <- read.table("hcexample.txt")
matplot(E,type="l",col=1:4,lty=1:4,lwd=3,xlab="Gene",ylab="log2-ratio",xaxt="n")
axis(1,1:4)
legend(3.3,2.8,c("A","B","C","D"),lty=1:4,col=1:4,lwd=3,y.intersp=1,cex=1.3)

Cluster algorithms group similar data together. What is meant by the word similar is formally defined by the notion of a distance. In R, the bioDist package gives a collection of functions for calculating distance measures. We will have a look at two of them in more detail.

Calculate the Euclidean (euc) and the Pearson sample correlation (cor.dist) distance between the sample profiles.

# Note that you have to transpose (t) the data matrix E since pairwise distances 
# are calculated for rows of a matrix
d.euc <- euc(t(E))
d.euc

#          A        B        C
# B  2.00000                  
# C 10.00000 10.19804         
# D 10.19804 10.00000  2.00000

d.cor <- cor.dist(t(E),abs=FALSE)
d.cor

#   A B C
# B 2    
# C 0 2  
# D 2 0 2

Note that you can always obtain a detailed explanation of a function by typing ? followed by the name of the function in the Console window, for example ?euc or ?cor.dist.

Question 1 Can you explain the resulting distance matrix when using the Pearson correlation distance?

Computer lab, Bioinformatics: Omics data analysis

Perry Moerland

Tuesday, February 23, 2021

1 Unsupervised methods

2 Quality control

3 Differential expression