6.1 apply()

In Chapter ??, we considered the simplicity with which we could perform simple transformations on vectorised data. In these examples a discrete transformation was applied to each value.

If we consider our chunk of gene expression data in the gene.atlas variable, we can imagine a number of simple transformations that we might wish to perform such as log2 transformations. There are more cases when we might wish to perform an analysis by row or column within the data. While this could be managed using a for loop to iterate over the data there are a number of simpler ways to access the data. One way to do this is with apply().

apply() takes at least three arguments:

  • the matrix of input data,
  • the MARGIN, whether to perform the calculation by row (1) or column (2) and
  • the function to apply to the margin. The function can be a built in function or a user-defined function.

This function can be one that is predefined in R (such as mean or sd in the exercise below) or one you have created yourself as per the previous function section. In the latter case remember that the function should expect a vector as input and return a single value (e.g. it receives a vector of numbers and returns their mean).

The example below finds the standard deviation (sd) of the rows and columns of a randomly generated matrix with 1000 rows and 10 columns. Remember if you run this you will get different results because the numbers are randomly generated by using the sample() function.

dataset <- matrix(sample(10000:30000, 10000, replace=T), ncol=10)

# apply the standard deviation function on each row (MARGIN=1) of the matrix
rowSD <- apply(dataset, MARGIN=1, sd)
head(rowSD)                   # we only show the first 6 values using head()
## [1] 6154.254 5221.133 4391.697 5450.684 4569.088 4208.252
# apply the sd function on each column (MARGIN=2) of the matrix
colSD <- apply(dataset, MARGIN=2, sd)  
head(colSD)
## [1] 5702.597 5744.731 5767.889 5913.901 5813.725 5798.188

You can also use other functions such as range() to find the minimum and maximum of each row/column of a matrix:

apply(dataset, MARGIN=2, range)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
## [1,] 10002 10031 10025 10013 10002 10012 10002 10003 10032 10027
## [2,] 29885 29967 29987 29945 29990 29980 29995 29975 29996 29902

Using apply

Try the above examples using apply(), then repeat this on our minTemp dataset

Hint:

daily.SD <- apply(minTemp, MARGIN=1, sd, na.rm=T)

Remember we have missing values in our dataset, so we need to specify the na.rm=T parameter setting to remove NA values before calculating the standard deviation.

For more information, read the help documentation for apply to find out more about the paremeter settings.

Expected results only showing the first 6 elements

##    Jan-1    Jan-2    Jan-3    Jan-4    Jan-5    Jan-6 
## 2.289750 2.301163 2.392509 2.512555 2.357019 2.487345
##    X1949    X1950    X1951    X1952    X1953    X1954 
## 5.864297 4.804218 5.564952 5.720759 5.796398 4.825505

Optional exercises

Try the following if you are ahead:

  1. Which year had the highest standard deviation? Hint save the results from the previous exercise to a variable and use which.max()

  2. Which year had the lowest standard deviation?

  3. Which day had the highest standard deviation across the 69 years?

  4. Which day had the lowest standard deviation across the 69 years?

You can also use your own user-defined function in apply(). The comments in the following example explains each section:

# Create a user-define function that expects numeric data
# calculates the mean temperature before converting it to fahrenheit
# returns the single mean temperature as fahrenheit
mean.celsius.to.fahrenheit <- function(temp){
  mean.temp <- mean(temp,na.rm=T)
  return((9/5)*mean.temp+32)
}

# Call the conversion function to return the 
# mean tempature per week (MARGIN=1 is rows)
head(apply(minTemp, MARGIN=1, mean.celsius.to.fahrenheit))
##    Jan-1    Jan-2    Jan-3    Jan-4    Jan-5    Jan-6 
## 68.96618 69.17794 69.60412 69.42676 69.49294 69.31294
# Call the conversion function to return the
# mean temperature per day of the week (MARGIN=2 is cols)
head(apply(minTemp, MARGIN=2, mean.celsius.to.fahrenheit))
##    X1949    X1950    X1951    X1952    X1953    X1954 
## 53.83009 59.63715 56.04122 59.61377 58.50044 60.48488

apply

  • apply is a crucial method for exploring data within the rows and columns of a data frame or matrix.

  • lapply is similar to apply but it returns results as a list rather than a matrix.

  • There is also the dplyr package, which is even more powerful but is beyond the scope of this workshop. See https://cran.r-project.org/web/packages/dplyr/dplyr.pdf