The objective of these computer exercises is to become familiar with the basic structure of R and to learn some more sophisticated tips and tricks of R. The R code to perform the analyses, together with the output, is given in a separate file. However, you are strongly advised to write the code yourself. Questions are posed to reflect on what you are doing and to study output and results in more detail.

1 Basics

If you have not done so already, start R via Starten - Alle programma’s - R - R x64 3.4.4. This opens the R console. R is a command line driven environment. This means that you have to type in commands (line-by-line) for it to compute or calculate something. In its simplest form R can therefore be used as a pocket calculator:

2+2
# [1] 4

Question 1. (a) Look up today’s exchange rate of the euro versus the US$ on the Internet. Use R to calculate how many US$ is 15 euro.

# This is the current exchange rate
15 * 1.1767
# [1] 17.6505

(b) Round the number you found to the first decimal using function round (use help or ? to inspect the arguments of the function) and assign the result to an object curr.

curr <- round(15 * 1.1767, 1)
curr
# [1] 17.7

(c) Use mode, str, and summary to inspect the object you created. These functions give a compact display of the type of object and its contents.

mode(curr)
# [1] "numeric"
str(curr)
#  num 17.7
summary(curr)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    17.7    17.7    17.7    17.7    17.7    17.7

Luckily R is more than just a calculator, it is a programming language with most of the elements that every programming language has: statements, objects, types, classes etc.

2 R syntax: data and functions

Question 2. (a) One of the simplest data structures in R is a vector. Use the function seq to generate a vector vec that consists of all numbers from 11 to 30 (see ?seq for documentation).

vec <- seq(11,30)
vec
#  [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

(b) Select the 7th element of the vector.

vec[7]
# [1] 17

(c) Select all elements except the 15th.

# Use '-' to exclude elements
vec[-15]
#  [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 26 27 28 29 30

(d) Select the 2nd and 5th element of the vector.

# Use the concatenate function 'c'
vec[c(2,5)]
# [1] 12 15

(e) Select only the odd valued elements of the vector vec. Do this is in two steps: first create the appropriate vector of indices index using the function seq, then use index to make the selection.

index <- seq(1,length(vec),by=2)
vec[index]
#  [1] 11 13 15 17 19 21 23 25 27 29

Until now you have only used the R console. However, RStudio (Starten - Alle programma’s - R - RStudio) provides a much nicer and richer interface when working with R. Although all exercises can be made within the basic R environment, we highly recommend to use RStudio.

Question 3. (a) What are the windows that RStudio consists of?

Script, Console, Environment, History, Files, Plots, Packages, Help, Viewer.

From now on we advise you to use RStudio. You can use either the Console or the Script window (the upper left window). We recommend you to use the Script window, since this allows you to easily save your code to a file for later usage.

(b) Use the elementary functions /, -, ^ and the functions sum and length to calculate the mean \(\bar{x}=\sum_i x_i/n\) and the standard deviation \(\sqrt{\sum_i(x_i-\bar{x})^2/(n-1)}\) of the vector vec of all numbers from 11 to 30. You can verify your answer by using the built-in R functions mean and sd.

# Don't call the variable 'mean' since the function 'mean' already exists
xmean <- sum(vec)/length(vec)
xmean
# [1] 20.5
mean(vec)
# [1] 20.5
# Calculate the standard deviation in three (baby) steps 
numerator <- sum((vec-xmean)^2)
denominator <- (length(vec)-1)
xstd  <- sqrt(numerator/denominator)
xstd
# [1] 5.91608
sd(vec)
# [1] 5.91608

(c) Once you completed the analysis, have a look at the Environment and History windows. Do you understand their contents?

Question 4. R comes with many predefined datasets. You can type data() to get the whole list. The islands dataset gives the areas of the world’s major landmasses exceeding 10,000 square miles. It can be loaded by typing:

# This could in this case actually be skipped since the package datasets is 
# already loaded
data(islands)

(a) Inspect the object using the functions head, str etc. Also have a look at the help file using help(islands).

(b) How many landmasses in the world exceeding 10,000 square miles are there?

# From ?islands: "Description: The areas in thousands of square miles of 
# the landmasses which exceed 10,000 square miles." So all of them:
length(islands)
# [1] 48

(c) Make a Boolean vector that has the value TRUE for all landmasses exceeding 20,000 square miles.

# Remember that area was given in thousands of square miles
islands.more20 <- (islands > 20)

(d) Select the islands with landmasses exceeding 20,000 square miles.

# Use the Boolean vector islands.more20 to select the ones with value TRUE
islands[islands.more20] 
#          Africa      Antarctica            Asia       Australia          Baffin 
#           11506            5500           16988            2968             184 
#           Banks          Borneo         Britain         Celebes           Celon 
#              23             280              84              73              25 
#            Cuba           Devon       Ellesmere          Europe       Greenland 
#              43              21              82            3745             840 
#      Hispaniola        Hokkaido          Honshu         Iceland         Ireland 
#              30              30              89              40              33 
#            Java           Luzon      Madagascar        Mindanao        Moluccas 
#              49              42             227              36              29 
#      New Guinea New Zealand (N) New Zealand (S)    Newfoundland   North America 
#             306              44              58              43            9390 
#   Novaya Zemlya        Sakhalin   South America         Sumatra        Tasmania 
#              32              29            6795             183              26 
#        Victoria 
#              82

(e) Make a character vector that only contains the names of the islands.

islands.names <- names(islands)

(f) The Moluccas have mistakenly been counted as a single island. The largest island of the Moluccas, Halmahera, only has about 7,000 square miles. Remove Moluccas from the data. Hint: an elegant solution uses the Boolean operator !=.

notMoluccas <- islands.names != "Moluccas"
islands.withoutMoluccas <- islands[notMoluccas]
# Another solution is:
islands.withoutMoluccas <- subset(islands,islands.names!="Moluccas")

We will work with a data set that gives the survival status of passengers on the Titanic. See titanic3info.txt for a description of the variables. Some background information on the data can be found on titanic.html.

Question 5. (a) Download the titanic data set titanic3.dta in STATA format from the course website and import it into R. Give the data set an appropiate name as an R object. E.g., we can call it titanic3 (but feel free to give it another name). Before importing the data, you first have to load the foreign package in which the function read.dta is defined. Moreover, you probably will need to point R to the right folder where you saved the file titanic3.dta. You can do this by changing the so-called working directory, which will be explained later. Using the basic R environment you can do this via the menu (File - Change dir), and similarly for RStudio (Session - Set Working Directory - Choose Directory).

library(foreign)
titanic3 <- read.dta("titanic3.dta", convert.underscore=TRUE)

We take a look at the values in the titanic data set.

(b) First, make the whole data set visible in a spreadsheet like format (how this is done depends on whether base R or RStudio is used).

This can be done using View on the command line. You can also use the menu: in R use Edit - Data editor, in RStudio click on the name of the data set in the Environment window.

(c) Often, it is sufficient to show a few records of the total data set. This can be done via selections on the rows, but another option is to use the functions head and tail. Use these functions to inspect the first 4 and last 4 records (instead of 6 which is the default).

The function dim can be used to find the number of rows (passengers in this case) and columns (variables) in the data.

dim(titanic3)

An alternative is the function str. This function shows information per column (type of variable, first records, levels in case of factor variables) as well as the total number of rows and columns.

str(titanic3)

(d) Issue both commands and study the output.

dim(titanic3)
# [1] 1309   17

(e) Summarize the data set by using the summary function.

summary(titanic3[,c("survived","pclass","home.dest","dob")])
#     survived     pclass     home.dest              dob        
#  Min.   :0.000   1st:323   Length:1309        Min.   :-46647  
#  1st Qu.:0.000   2nd:277   Class :character   1st Qu.:-31672  
#  Median :0.000   3rd:709   Mode  :character   Median :-27654  
#  Mean   :0.382                                Mean   :-28341  
#  3rd Qu.:1.000                                3rd Qu.:-25097  
#  Max.   :1.000                                Max.   :-17488  
#                                               NA's   :263

This gives a summary of all the variables in the data set. We can see that for categorical variables like pclass a frequency distribution is given, whereas for continuous variables like age numerical summaries are given. Still, for some variables we do not automatically get what we would like or expect:

  • The variable survived is dichotomous with values zero and one, which are interpreted as numbers.
  • The categorical variable pclass is represented differently from the categorical variable home.dest.
  • The variable dob, which gives the date of birth, is represented by large negative numbers.

In subsequent exercises, we will shed further light on these anomalies and we will try to repair some of these.

Question 6. We can also summarize specific columns (variables) of a data.frame. There are many ways to summarize a variable, depending on its structure and variability. For continuous variables, the same summary function can be used, but other options are the functions mean, quantile, IQR, sd and var. For categorical summaries, one may use summary and table. Note that missing values are treated differently depending on the function used.

(a) Summarize the age variable in the titanic data set (the age column is selected via titanic3$age). Give the 5%, 25%, 50%, 75% and 95% quantiles and the inter-quartile range. You may have to take a look at the help files for the functions.

# Note that you have to convert percentages into probabilities
quantile(titanic3$age, probs=c(0.05,0.25,0.5,0.75,0.95), na.rm=TRUE)
#  5% 25% 50% 75% 95% 
#   5  21  28  39  57
IQR(titanic3$age, na.rm=TRUE)
# [1] 18

(b) Summarize the survived variable using the table function. Also give a two-by-two table for sex and survival status using the same table function.

table(titanic3$survived)
# 
#   0   1 
# 809 500
table(titanic3$sex, titanic3$survived)
#         
#            0   1
#   female 127 339
#   male   682 161

Question 7 (OPTIONAL).

Instead of the file in STATA format, we also made available part of the titanic data set in tab-delimited format (titanic3select.txt). Download this file from the course website and then import it using the command read.table. Note that this file has been manipulated in Excel and that importing the data is not as straightforward as you would have hoped. You will need to tweak the arguments of read.table and/or make some changes to the file manually.

Running the following line of code:

titanic3.select <- read.table("titanic3select.txt",sep="\t",header=TRUE)

throws an error message:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 16 did not have 18 elements Error during wrapup: cannot open the connection

This error message can in principle be corrected by adding the argument fill=TRUE, which adds blank fields if rows have unequal length. However, running

titanic3.select <- read.table("titanic3select.txt",sep="\t",header=TRUE,fill=TRUE)

throws a warning message:

Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string

It turns out that the data was not correctly imported, which is often easily diagnosed by inspecting the dimensions:

dim(titanic3.select)

Indeed the number of rows is 25 instead of the expected 30. This is caused by a trailing ’ at row 25 for variable home_dest. This can be corrected by changing the default set of quoting characters:

titanic3.select <- read.table("titanic3select.txt", sep="\t", header=TRUE, fill=TRUE, quote="\"")

A closer inspection of the data shows that the last column only contains NAs caused by trailing spaces. On the course website there is a corrected version of the file where the last column and the trailing ’ at row 25 were deleted.

titanic3.select <- read.table("titanic3select_corrected.txt",sep="\t",header=TRUE)

Question 8. We have a further look at the output from the summary function, which was not always what we would like to see. First, give the commands (sapply will be explained later):

sapply(titanic3, mode)
#      pclass    survived        name         sex         age       sibsp 
#   "numeric"   "numeric" "character"   "numeric"   "numeric"   "numeric" 
#       parch      ticket        fare       cabin    embarked        boat 
#   "numeric" "character"   "numeric"   "numeric"   "numeric"   "numeric" 
#        body   home.dest         dob      family      agecat 
#   "numeric" "character"   "numeric"   "numeric"   "numeric"
sapply(titanic3, is.factor)
#    pclass  survived      name       sex       age     sibsp     parch    ticket 
#      TRUE     FALSE     FALSE      TRUE     FALSE     FALSE     FALSE     FALSE 
#      fare     cabin  embarked      boat      body home.dest       dob    family 
#     FALSE      TRUE      TRUE      TRUE     FALSE     FALSE     FALSE      TRUE 
#    agecat 
#      TRUE

We see that survived has mode “numeric” and is not a factor (is.factor(titanic3$survived) gives FALSE). It is interpreted in the same way as the truly continuous variable age.

(a) Add a variable status to the titanic data set, which gives the survival status of the passengers as a factor variable with labels “no” and “yes” describing whether individuals survived. Make the value “no” the first level (see titanic3info.txt for the reason).

# titanic3info.txt: Survival (0 = No; 1 = Yes)
titanic3$status <- factor(titanic3$survived, labels=c("no","yes"))

(b) Variable pclass has mode “numeric” and is a factor, whereas home.dest has mode “character” and is not a factor. Give the commands

as.numeric(head(titanic3$pclass))
as.numeric(head(titanic3$home.dest))

and explain the difference in output.

as.numeric(head(titanic3$pclass))
# [1] 1 1 1 1 1 1
as.numeric(head(titanic3$home.dest))
# [1] NA NA NA NA NA NA

We do not change the variables in this dataset that have mode “character”. They are variables that have many different values, and there is no reason to convert them into factors.

Question 9. (a) Take a look at the name (name), and the home town/destination (home.dest) of all passengers who were older than 70 years. Use the appropriate selection functions.

# Note that the function 'subset' also leaves out the passsengers for which the
# variable 'age' is missing 
subset(titanic3,age>70)[,c("name","home.dest")]
#                                 name               home.dest
# 10           Artagaveytia, Mr. Ramon     Montevideo, Uruguay
# 15   Barkworth, Mr. Algernon Henry W           Hessle, Yorks
# 62    Cavendish, Mrs. Tyrell William Little Onn Hall, Staffs
# 136        Goldschmidt, Mr. George B            New York, NY
# 728             Connors, Mr. Patrick                        
# 1236             Svensson, Mr. Johan
# Another solution is
subset(titanic3,age>70,select=c(name,home.dest))
#                                 name               home.dest
# 10           Artagaveytia, Mr. Ramon     Montevideo, Uruguay
# 15   Barkworth, Mr. Algernon Henry W           Hessle, Yorks
# 62    Cavendish, Mrs. Tyrell William Little Onn Hall, Staffs
# 136        Goldschmidt, Mr. George B            New York, NY
# 728             Connors, Mr. Patrick                        
# 1236             Svensson, Mr. Johan
# Yet another solution is
titanic3[titanic3$age>70 & !is.na(titanic3$age),c("name","home.dest")]
#                                 name               home.dest
# 10           Artagaveytia, Mr. Ramon     Montevideo, Uruguay
# 15   Barkworth, Mr. Algernon Henry W           Hessle, Yorks
# 62    Cavendish, Mrs. Tyrell William Little Onn Hall, Staffs
# 136        Goldschmidt, Mr. George B            New York, NY
# 728             Connors, Mr. Patrick                        
# 1236             Svensson, Mr. Johan

(b) There is one person from Uruguay in this group. Select the single record from that person. Did this person travel with relatives?

subset(titanic3, name=="Artagaveytia, Mr. Ramon")
#    pclass survived                    name  sex age sibsp parch   ticket
# 10    1st        0 Artagaveytia, Mr. Ramon male  71     0     0 PC 17609
#       fare cabin  embarked boat body           home.dest       dob family
# 10 49.5042       Cherbourg        22 Montevideo, Uruguay -43359.75     no
#     agecat status
# 10 (40,80]     no
# No he didn't travel with relatives, since the variable 'family' is 'no'

(c) Make a table of survivor status by sex, but only for the first class passengers. Use the xtabs function.

xtabs(~sex+status, data=titanic3, subset=(pclass=="1st"))
#         status
# sex       no yes
#   female   5 139
#   male   118  61

3 Graphics

3.1 Basic graphics

The graphics subsystem of R offers a very flexible toolbox for high-quality graphing. There are typically three steps to producing useful graphics:

  • Creating the basic plot
  • Enhancing the plot with labels, legends, colors etc.
  • Exporting the plot from R for use elsewhere

In the graphics model that R uses, a figure region consists of a central plotting region surrounded by margins. The basic graphics command is plot. The following piece of code illustrates the most common options:

plot(3,3,main="Main title",sub="subtitle",xlab="x-label",ylab="y-label")
text(3.5,3.5,"text at (3.5,3.5)")
abline(h=3.5,v=3.5)
for (side in 1:4) mtext(-1:4,side=side,at=2.3,line=-1:4)
mtext(paste("side",1:4),side=1:4,line=-1,font=2)

In this figure one can see that

  • Sides are labelled clockwise starting with x-axis
  • Coordinates in the margins are specified in lines of text
  • Default margins are not all wide enough to hold all numbers -1,…,4

You might want to use the help function to investigate some of the other functions and options.

Plots can be further finetuned with the par function. For instance, the default margin sizes can be changed using par. The default settings are

par("mar")
# [1] 5.1 4.1 4.1 2.1

This explains why only side 1 in the figure had a wide enough margin. This can be remedied by setting

par(mar=c(5,5,5,5))

before plotting the figure.

Question 10. (a) Try making this figure yourself by executing the code shown above.

(b) Save the figure you just made to a file. For this you have to know that R sends graphics to a device. The default device on most operating systems is the screen, for example the “Plots” window in RStudio. Saving a figure, therefore, requires changing R’s current device. See help(device) for the options. Save the figure you just made to a png and a pdf file. Don’t forget to close the device afterwards.

png(file="figure.png")
# Add code here to plot the figure (including par(mar=c(5,5,5,5)))
dev.off()
pdf(file="figure.pdf")
# Add code here to plot the figure (including par(mar=c(5,5,5,5)))
dev.off()

(c) When saving a figure to file, default values for the figure size are used. Save the figure to a pdf file with width and height of 21 inches.

pdf(file="figureLarge.pdf",width=21,height=21)
# Add code here to plot the figure (including par(mar=c(5,5,5,5)))
dev.off()

Question 11. The quakes data set gives location and severity of earthquakes off Fuji. First load the data:

data(quakes)

(a) Make a scatterplot of latitude versus longitude. You should obtain a graph similar to:

plot(quakes$long, quakes$lat)

(b) Use cut to divide the magnitude of quakes into three categories (cutoffs at 4.5 and 5) and use ifelse to divide depth into two categories (cutoff at 400). Hint: have a careful look at the (admittedly complicated) help page of cut, in particular the arguments breaks and include.lowest.

mag.cat <- with(quakes,cut(mag, breaks = c(4, 4.5, 5, 7),
                           include.lowest = TRUE ))
# More generic using the functions 'min' and 'max' that return the 
# minimum and maximum, respectively, of a vector
mag.cat <- with(quakes,cut(mag, breaks = c(min(mag), 4.5, 5, max(mag)),
                           include.lowest = TRUE ))
levels(mag.cat) <- c("low", "medium", "high")
depth.cat <- factor(ifelse(quakes$depth>400, "deep", "shallow"))

(c) Redraw the plot, with symbols by magnitude and colors by depth. You should obtain a graph similar to:

# Note that for 'col' the factor 'depth.cat' is interpreted as numeric (deep=1 and 
# shallow=2) and that the first two colours of palette() are used. Strangely enough 
# for 'pch' this is not the case and the factor ''mag.cat' has to be converted to a 
# numeric vector using the function 'as.numeric'.
plot(quakes$long, quakes$lat, pch=as.numeric(mag.cat), col=depth.cat)

(d) The magnitude of the earthquakes is given in Richter scale. Calculate the energy released in each earthquake as \(10^{(3/2)}\) to the power of the Richter measurement.

energy <- (10^(3/2))^quakes$mag

(e) The cex argument of plot can be used to scale the plot symbols. We will scale the plot symbols so that the surface of the plot symbol is proportional to the released energy. Calculate plot symbol size as the square root of the energy divided by the median square root of the energy (to get median symbol size 1).

symbolsize <- sqrt(energy)/median(sqrt(energy))

(f) Plot the magnitude of earthquakes again, but with plot symbols sized by energy. Keep the coloring by depth. You should obtain a graph similar to: