# This tutorial for R is intended for new users of R.
# Before you start on the tutorial you must install R on your computer - see
# http://www.finse.uio.no/events/international-workshops/introduction-to-estimation
# for instructions. I also recommend the editor TINN-R for editing R scripts.
# You can also open this script in the simple built-in editor i R: Open this
# file in "File > Open script" in R.
# R will ignore all lines that starts with a "#". This is used for writing
# comments in the script. All other lines will be interpreted by R. Run this
# script line by line in R. If you are using the built-in R editor, you can run
# one line at a time by holding down the 'ctrl' key and hitting "r" on your
# keyboard. In TINN-R you can do the same by clicking on the "Send line" button
# or you can set up a "hotkey" for this action in the "R" menu.
##############################
## USING R AS A CALCULATOR: ##
# R is really just an advanced calculator made primarily for statistical
# analyses
# Try the following operations:
2+2
(2+2)*3
3/2
2^3
log(1) # the natural logarithm (ln)
exp(1) # e to the power of
log10(100) # log with base 10
##-- end of section --##
############################
## OBJECTS AND FUNCTIONS: ##
# You can assign the results of operations to "objects" and then use
# these objects in later operations. Try:
a = 2+2
a
a*3
# R will remember the values of such "objects" as long as you want
# In older versions of R, assignments had to be done with "<-" (an "arrow").
# Many people still use this ("=" and "<-" does the same thing):
a <- 3+3
a
a*3
# Note that you now replaced the value of 'a' with a new value
# There are many kinds of 'objects' in R. You can for example make a vector
# containing the body mass of 5 individuals
body.mass = c(50.3, 55.4, 52.9, 59.7, 60.1)
body.mass
# Here c(...) just tells R that the numbers within the parenthesis should be
# stored in a vector. "body.mass" is just a name that you decide. The "." is
# just part of the name and has no special meaning for R.
# R has many "functions". Try:
mean(body.mass) # returns mean body mass
sd(body.mass) # returns the standard deviation of body mass
# Functions in R takes numbers or objects as input arguments, do some operations
# on these, and return the resulting number(s) or objects
# Just as above, you can assign the output of functions to new objects and
# perform new operations on these objects. Try to compute the coefficient of
# variation in the following way:
mean.body.mass = mean(body.mass)
sd.body.mass = sd(body.mass)
cv.body.mass = sd.body.mass/mean.body.mass
cv.body.mass
# This can be done more directly:
cv.body.mass = sd(body.mass)/mean(body.mass)
cv.body.mass
# You can also make your own functions in R. You can for example make a new
# function that calculates the coefficient of variation:
cv = function(x) sd(x)/mean(x)
# The line above tells R that it should create a new function called "cv" that
# takes one argument, x, and compute sd(x)/mean(x). 'x' is just an internal
# object that is not "visible" outside the function. If you have an object
# called "x" in the memory "outside" the function, this will not be overwritten.
# You can now use this function to calculate the coefficient of variation of
# your body.mass vector (or any other vector).
cv(body.mass)
# There are very many "built in" functions in R (you can use "R card" in TINN-R
# to search for functions that do different things). To get help on a function
# you can type "?" followed by the name of the function - for example:
?mean
?lm
# A help window should now pop up. You will see that the function 'mean' can
# take other arguments (that have a default value that is used if you don't
# specify anything else). If you are using TINN-R, a yellow box will pop up as
# you start typing, telling you what arguments the function can take (try typing
# "lm(..." in TINN-R). Note that the name of the arguments doesn't have to be
# given IF they appear in the order they are given in the help file.
##-- end of section --##
##########################
## VECTORS AND MATRICES ##
# To create a vector with the values 0 to 10, write
x = 0:10
x
# To create a vector of the numbers 0,10,20,...,100, write
y = seq(0, 100, by=10)
y
# If you add a single number (a "scalar") to a vector, this number is added to
# every element of the vector
y + 2
# if you add two vectors of the same length to each other, the first element of
# the first vector is added to the first element of the second vector, and so
# on:
x + y
# ...the same with other operations
x * y
y - x
y/x
# If you use a vector in a function that normally just does an operation to a
# scalar (a single number) then this function will be applied on each element
# of the vector
log(x)
# To see the number of elements in a vector use
length(x)
length(y)
# If you add a shorter vector to longer vector, the shorter vector will be
# "resycled"
z = 1:10
z + c(0,1) # The first element is added 0, the second element added 1, third element added 0, etc.
# If the longer vector is not a multiple of the shorter vector, R will issue a
# warning message
z = 1:10
z + c(0,1,2)
# To create matrices you can use the function 'matrix'. For example, to create
# a 2-by-2 matrix with the values 1:4 column wise, use
x = matrix(1:4, 2, 2)
x
# Matrices are treated in much the same way as vectors (for multi-dimensional
# arrays, use 'array' (type ?array))
# For matrix multiplication, use the '%*%' operator
y = c(1,2)
z = x %*% y
z # z[1,1] = x[1,1]*y[1] + x[1,2]*y[2], and z[2,1] = x[2,1]*y[1] + x[2,2]*y[2]
# Note that R treats vectors as either column vectors or row vectors depending
# on the situation. Above y is treated as a column vector, below it is treated
# as a row vector
z = y %*% x
z # z[1,1] = y[1]*x[1,1] + y[2]*x[2,1], and z[1,2] = y[1]*x[1,2] + y[2]*x[2,2]
##-- end of section --##
############################
## WORKING WITH DATA SETS ##
# We can put many variables of equal length together in what is called a
# 'data.frame'. For example, create the data.frame "my.data":
my.data = data.frame(body.length = c(170, 168, 156, 169, 172), body.mass = body.mass)
my.data
# To access the variable "body.length" in the data.frame "my.data" you have to
# first write the name of the data.frame and then the name of the variable
# seperated by a "$" sign:
my.data$body.length
# Note that you now have both an object called "body.mass" and a variable with
# the same name within the object called "my.data". If we change the object
# called "body.mass", this will not change the varaible "body.mass" within
# "my.data". Try it:
body.mass = 100
body.mass
my.data$body.mass
# NB! It is a good idea to not use the same name on a variable within a
# data.frame as the name of an object. In that way you avoid mistakes by using
# the wrong object. See more on objects and search paths below.
# To read in data from for example Excel, the easiest is to save the file as a
# comma dilimited text file (csv extention) and use the 'read.csv' function in
# R (csv-file can also be opened directly in Excel whithout using the import
# wizard).
# I have uploaded a data-set containing measurements of diameter, height and
# volum of 31 trees on http://www.finse.uio.no/events/international-workshops/introduction-to-estimation/data/trees.csv
# To read these data into R and store it in a data.frame called "trees", use
trees = read.csv("http://www.finse.uio.no/events/international-workshops/introduction-to-estimation/data/trees.csv")
trees
# To read a csv-file from your hard drive, write the full path to the file as
# "D:/.../" NB! NB! R uses "/" (forward slash) in the file paths, whereas
# Windows uses "\" (backward slash)
# Get a summary of the data.frame:
summary(trees)
# Look at the 5 first rows only:
trees[1:5,]
# Look at the 5 first rows and 2 first column only:
trees[1:5,1:2]
##-- end of section --##
########################################
## SIMULATING DATA AND FITTING MODELS ##
# You can draw random elements from a vector with the 'sample' function. For
# example the following draws 10 random numbers from 1:100 without replacement
sample(1:100, 10) # try this several times and observe that different numbers are drawn every time
# To draw numbers from a vector with replacements (so you can draw the same
# number more than once), use the 'replace = T' argument ("T" is short for
# "TRUE"). The following draws 10 random numbers from 1 to 5:
sample(1:5, 10, replace=T)
# You can also draw random numbers from a probability distribution. To draw 10
# random numbers from a normal distribution with mean 20 and standard deviation
# 2, use
rnorm(10, mean=20, sd=2)
# The following creates a vector of 10 000 numbers drawn from this distribution
# and show the distribution of this sample in a histogram
x = rnorm(10000, mean=20, sd=2)
hist(x)
# Simulating data in this manner is a great way to teach yourself statistics.
# You can for example use simulations to try out different functions in R or
# new methods you want to learn. For example, to try out the fitting of a simple
# linear model where y is dependent on x and where there is some measurement
# error in the observations of y, you first have to decide on sample size (n),
# the values of x, and the values of the parameters in the model intercept (a),
# slope (b) and standard deviation of the measurement error (sd.e). Here I pick
# some numbers:
n = 100
x = seq(10, 20, length=n) # n numbers at equal intervals between 10 and 20
a = 1
b = 0.1
sd.e = 0.5
# You can then simulate observations (y) from the model y = a + b*x + e, where e
# is a random number drawn from a normal distribution with mean zero and
# standard deviation sd.e
e = rnorm(n, mean=0, sd=sd.e)
y = a + b*x + e
# To plot the simulated data use
plot(y,x)
# You can then use the 'lm' function in R to fit this model to the data you have
# now simulated
lm(y~x) # This tells R that y should be modelled as a linear function of x ('lm' always assume a normally distributed error)
# ...the first number is an estimate of the intercept, a, and the second number
# is an estimate of the slope, b. If you try increasing the sample size, you
# will see that these numbers will become closer to the true values of the
# parameters you have used in the simulation.
# The 'lm' function calculates a lot more useful things than what is shown
# above. To access this, you must first store an object from the 'lm' function
fit = lm(y~x)
# To get a summary of this model object, type
summary(fit)
# You now also get standard errors (measures of uncertainty) and you get an
# estimate of the standard deviation of the measurement error (given as
# "Residual standard error").
# You can also let R calculate the 95% confidence intervals of the parameter
# estimates for you by using the 'confint' function
confint(fit)
# If you increase the sample size you will see that the standard error becomes
# smaller, the confidence intervals become narrower and the parameters
# estimates tend to become closer to the true values of the parameters.
# The model object created by the 'lm' function contains many "sub-objects".
# To see the names of these "sub-objects" type
names(fit)
# There is a description of each of these "sub-ojects" in the help (type '?lm').
# You can access these "sub-obects" by using the "$" sign in the same way as for
# data.frames. For example, the following makes a plot of the residuals versus
# the fitted (predicted) values
plot(fit$residuals, fit$fitted.values)
# The 'summary' function also creates an object with many sub-objects. Try
names(summary(fit))
# Simulations of the type we did above can be repeated many times in a "loop".
# The script below repeats the simulations and model fitting results 1000 times
# and stors the parameter estimates every time in a data.frame
n.sim = 1000 # number of simulations
estimates = data.frame(a=rep(NA,n.sim), b=rep(NA,n.sim), sd.e=rep(NA,n.sim)) # creating an empty data.frame
n = 100
x = seq(10, 20, length=n) # n numbers at equal intervals between 10 and 20
a = 1
b = 0.1
sd.e = 0.5
for(i in 1:n.sim){ # start of loop
e = rnorm(n, mean=0, sd=sd.e)
y = a + b*x + e
fit = lm(y~x)
estimates[i,1:2] = fit$coefficients
estimates[i,"sd.e"] = summary(fit)$sigma
} # end of loop
summary(estimates)
hist(estimates$a) # histogram of the estimates of the intercept
abline(v=a, col="red") # superimposing the true value as a red line
hist(estimates$b) # histogram of the estimates of the slope
abline(v=b, col="red") # superimposing the true value as a red line
hist(estimates$sd.e) # histogram of the estimates of the residual standard deviation
abline(v=sd.e, col="red") # superimposing the true value as a red line
# Simulations of this type can be used to investigate the influence of sample
# size and study design on parameter estimates.
# If the parameter values (values of a, b and sd.e) were obtained from fitting
# the model to a real data set, then the procedure above would have been an
# example of "Parametric Bootstrapping". Other methods that make use of
# simulations include Marcov-Chain-Monte-Carlo (MCMC) simulations to generate
# the posterior parameter distributions in Bayesian statistics.
######################################
## MORE ON OBJECTS AND SEARCH PATHS ##
# In text books, data.frames are often "attached" so that you can
# access the variables within it without using the "$" sign. For example, if
# you write
attach(my.data)
# you can access my.data$body.length by just writing
body.length
# However, "body.mass" will still refer to the object called only "body.mass",
# and not the variable "body.mass" in the data.frame you have attached
body.mass
my.data
# Objects in R are organized in a hierarchal "search path" that you can see by
# typing
search()
# At the first level you always have the ".GlobalEnv". When you refer to an
# object with a given name, R will first look if it finds the object here, and
# use this if it does. If there is no object with the given name here, R will
# look for an object in the second environment in the list, then the third,
# fourth, and so on. In the second position in the search path you have the
# data.frame that you just attached. The subsequent positions in the search path
# are default packages with objects and functions that come with R (there are
# hundreds of such packages available in R that you can download and install
# when you need them).
# To see the objects you have in the ".GlobalEnv", type
objects()
# or
ls()
# To see the objects in the second position, type
objects(2)
# and so on.
# WARNING! If you "attach" data.frames you can easly make mistakes by using an
# object in the ".GlobalEnv" when you intend to use a variable in a data.frame
# that you have attached. For this reason, it is safest to avoid attaching
# data.frames to the search path and instead use the full reference to the
# variable by using the $ structure.
# If you have attached a data.frame you should "detatch" it when you are done
# with it:
detach("my.data")
search() # check that my.data is removed from the search path
# It is also a good habit to start all analysis scripts that you write by
# removing all objects in the ".GlobalEnv". In that way you avoid using objects
# from a previous analysis by misstake. To remove all objects, type
rm(list=ls(all=TRUE)) # Look up in the help to see what this means. There is also a button for doing this in TINN-R.
##-- end of section --##