2.6 Data structures

We do not typically expect to work with atomic variables during an -omics analysis. To make the best use of R we want to use hundreds, thousands and millions of data points. We arrange these in one of a number of different data structures: vectors, lists, matrices and data frames.

2.6.1 Vectors

A vector is a group of components of the same type. Vectors are created using the c() function:

data <- c(2,3,34,2,43,234,2,342,43,423)
data
length(data)
##  [1]   2   3  34   2  43 234   2 342  43 423
## [1] 10

2.6.1.1 Indexing

Elements in vectors are accessed directly by their position, this is referred to as indexing. The first element starts at index 1.

You can access positions consecutively or by jumping around:

Operator Description
vector <- c() function (for combine) is used to join a number of single values together into a vector
vector[i] return element at position \(i\) of a vector, positions start from 1
vector[start:end] slicing, get elements from \(start\) positon to the \(end\) position
vector[c(3,5,10)] retrieves positions 3, 5, 10 of the vector
vector[-i] return all elements except position \(i\)
length(vector) returns the number of elments in the vector
data[3:7]
## [1]  34   2  43 234   2
data[c(3,5,9)]
## [1] 34 43 43

You can also specify which elements not to return by using the - sign infront of the index.

data[-2]
## [1]   2  34   2  43 234   2 342  43 423
data[-c(2,4,8)]
## [1]   2  34  43 234   2  43 423

Note that the above does not delete the elements from the original data variable, it just does not show them to screen. If you want to save a copy of the vector without the specified positions then you need to assign it to a new variable:

new.data <- data[-c(2,4,8)]

2.6.1.2 Mixed vectors

What happens if we try to mix data types in a vector? See the following worked example:

mixedVector <- c(shape = "rectangle", 
                 width = as.integer(42),
                 length= 3.25)
mixedVector
##       shape       width      length 
## "rectangle"        "42"      "3.25"

Above, the variable \(mixedVector\) holds three elements of different data types, however a vector must consists of the same data type. So all elements are automatically cast to the most robust datatype which is a character.

The str() is a function that describes the structure of an variable:

str(mixedVector)
##  Named chr [1:3] "rectangle" "42" "3.25"
##  - attr(*, "names")= chr [1:3] "shape" "width" "length"

Furthermore, in this example, each element has an associated name. We can access the element using its name instead of the position:

mixedVector['length']
## length 
## "3.25"

2.6.1.3 Combining vector variables

c() can be used to combine vectors, not just individual values:

c(data, mixedVector)
##                                                                         
##         "2"         "3"        "34"         "2"        "43"       "234" 
##                                                       shape       width 
##         "2"       "342"        "43"       "423" "rectangle"        "42" 
##      length 
##      "3.25"

Using vectors

  1. Enter the chunks above into your RStudio interface and review the output.

  2. After executing the last code block, check on the values that are in variables data and mixedVector again.

    1. How many elements does each variable hold? _______
    2. Do any of them hold 6 elements? ______
    3. What do you need to do to retain the combined vector with 6 elements?
  3. Access the mixedVector element by name using mixedVector[‘shape’]. What is the output? _______

Challenge, continue if you are ahead

  1. Now enter the next chunk of code below and review the output.

    1. Can you determine the difference between mixedVector[2] and mixedVector[[2]]?
  2. What would the equivalent command to mixedVector[[2]] be, if you were accessing the vector by name?

  3. Does the following command: mixedVector[‘length’]*mixedVector[‘width’] work? Why?

Access some elements of mixedVector by position or name:

mixedVector[2]
mixedVector[[2]]
mixedVector['shape']
names(mixedVector)

2.6.2 Lists

Lists are similar to vectors in that they are one-dimensional structures for storing data and like vectors can be accessed by position or namespace. They differ from vectors in that they can contain multiple different data types. This makes them more flexible for storing data but more limited in the analyses that can be performed on them. Lists are created using the list() command, which has a similar format to c().

mixedList <- list(shape="rectangle", width=as.integer(42), length=3.25)
str(mixedList)
## List of 3
##  $ shape : chr "rectangle"
##  $ width : int 42
##  $ length: num 3.25

Checking for membership

%in% is a way to test the membership of a single element in a list or vector of items. For example, try typing in the command: “rectangle” %in% mixedList

Like vectors, elements of a lists can be accessed by their position or their name. The name can also be accessed using the $ notation. That is, the two lines below are eqivalent.

mixedList$shape
## [1] "rectangle"
mixedList['shape']
## $shape
## [1] "rectangle"
mixedList[['shape']]
## [1] "rectangle"

The $ notation is only accessible by Lists and Data frames (Section @ref(data-frame)).

2.6.3 Changing Vectors and Lists

While a list or vector is useful, their utility is much greater when we can add or remove elements from them, or change the value of existing elements.

# Add items to list
mixedList <- append(mixedList, c(units='mm'))
mixedList <- append(mixedList, c(area=mixedList$length * mixedList$width))

# Display the names of the lists
names(mixedList)
## [1] "shape"  "width"  "length" "units"  "area"
# Display just the second element of the list
mixedList[2]
## $width
## [1] 42
# Change the value of the 'units' item
mixedList$units <- 'cm'

# Remove the 'area' item from the list
mixedList$area <- NULL

Using Lists

  1. Enter the two mixedList code chunks into RStudio and review the output.

  2. Try and retrieve different elements of the list

    1. Can you get the second and fourth elements? Hint: use the combine c() function when indexing
    2. What about everything BUT the second and fourth?
  3. [Optional] If you are ahead, now try to getting a range of values, that is get the second to the forth elements. Hint: try using the colon (:) notation to specify a range.

2.6.4 Factors

Factors are vector objects that contain grouping (classification) information of its components. Functionally they are similar to vectors in that they are a collection of objects of a single type. Factors are best applied when there are a limited number of different values. They are often used when categorical values are used in modelling or presenting data, e.g. phenotypic or study class data, such as diabetic, pre-diabetic, healthy.

For the example that we explore in the following exercise, we are going to consider a massive bucket of M&Ms. Assuming that the different colours are randomly mixed with an equal probability we would like to explore a million different sweets.

Using Factors

Use the code below to create a factorised mandm vector containing 1 million M&Ms of various colours.

Note: the output is again grouped together at the end for readability. The table below explains the new functions and what they mean.

Create some colours and randomly select a million samples from this list of colours using the sample() function, with replacement (meaning duplicate) colours are allowed.

colours <- c("red", "yellow", "green", "blue", "orange")
mandm   <- sample(colours, 1000000, replace=TRUE)

object.size(mandm)
length(mandm)
## 8000280 bytes
## [1] 1000000
# convert this into a factor datatype
mandm <- as.factor(mandm)
str(mandm)
object.size(mandm)
##  Factor w/ 5 levels "blue","green",..: 1 2 3 2 3 1 5 4 4 2 ...
## 4000688 bytes

table() is a very handy method to return the frequency counts:

table(mandm)
## mandm
##   blue  green orange    red yellow 
## 199777 199673 200338 200862 199350

We can also ask questions like which positions are ‘blue’?. This is returned by the which() function.

head(which(mandm == "blue"))
## [1]  1  6 29 32 35 45
length(which(mandm == "blue"))
## [1] 199777

What is happening here?

Function Description
sample() is a really useful method that allows you to sample elements from your data collection. The replace variable defined whether a value that is sampled should be replaced following its selection.
object.size() describes the amount of memory that an object is using - this can be useful to identify really humungous objects that can be cleaned from your workspace.
length() describes the number of elements present in the vector.
table() prepares a tabular summary of values and the number of times that they occur - this can be used to summarise data effectively.
head() shows only the first n elements of the vector, here we specified show the top 10 elements.
which() returns a vector of the positions that means the condition, in this example which of the 1 million mandm’s are ‘white’? Since there are none, this returns an empty vector, represented by interger(0).
length(which()) nests the two functions together. It performs the inner which() function first, then passes the result to the length() function. As the names of the function suggest, this is finding out how many elements in mandm are equal to ‘blue’?

The real challenge with factors is not creating them or using them but adding novel content to them, at least in terms of new categories. If we find an M&M that is not one of our five original colours, we can’t just change the colour of that entry in our vector, we need to specifically add that colour to our factor list first. As a result, factors are suited more for immutable or static content. Other data structures provide simpler mechanisms for data manipulation.

Adding factor values

  1. Follow the code chunk below to try to change the colour of the first M&M to the value “white”. The comments explain what each line is doing. You may find it best entering these commands directly in the console rather than the editor window, so you can see what is happening at each stage.

  2. Add another colour option (e.g. brown) and change one or more of the vector elements to that new colour (e.g. elements from position 100 to 200)

  3. Challenge add yet another colour (“purple”) and this time randomly change 20 elements to the new colour purple.

# First, just try to change the first element to white
mandm[1] <- "white"
## Warning in `[<-.factor`(`*tmp*`, 1, value = "white"): invalid factor level,
## NA generated
# Check the factor values, have we successfully added white?
levels(mandm)

# Now explicitly add white as a factor value
levels(mandm) <- c(levels(mandm), "white")

# Try again to change the colour of the first M&M
mandm[1] <- "white"

# And check the factor values again
levels(mandm)
table(mandm)

2.6.5 Matrices

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout, effectively a table. Similar to the vector a matrix can contain only a single type of data.

It is often easiest to create a matrix from a vector of data. The next exercise will show you how to do this, follow the code below to generate a variety of matrices. R cannot guess the dimension of the data so you should specify either the number of rows (nrow) or columns (ncol) that the final data should have.

Common matrix operations:

Operation Description
matrix() create matrix
nrow() returns number of rows
ncol() returns number of columns
dim() returns the dimension

Creating a matrix

Note: this time the output follows immediately after each command, so that you can compare with what you get.

  1. Enter the code below to generate a variety of matrices. Hint: Use the Help tab to search for matrix to find out more about the function and what the nrow and byrow options do.

  2. Rather than printing the output directly to screen, store the matrix into a variable named testMatrix.

matrix(1:12)
##       [,1]
##  [1,]    1
##  [2,]    2
##  [3,]    3
##  [4,]    4
##  [5,]    5
##  [6,]    6
##  [7,]    7
##  [8,]    8
##  [9,]    9
## [10,]   10
## [11,]   11
## [12,]   12
matrix(1:12, nrow=4)
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
matrix(1:12, nrow=4, byrow=TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]   10   11   12
letter.mat <- matrix(LETTERS[1:12], nrow=4, byrow=TRUE)

Some useful functions

  • LETTERS is a built-in variable R that returns a vector of characters from A to Z.

  • The colon (:) syntax is a quick way of creating a vector of sequential numbers from start to end e.g. 200:225 will return a vector from 200 to 225.

  • The seq(from,to,by) function is also very handy when you need to create a regular sequence of numbers that jump by a certain interval, e.g. seq(20, 50, 2) will return a vector from 20 to 50, incrementing by 2. You can also do the reverse by entering seq(50, 20, -2).

2.6.5.1 Indexing a matrix

To access the elements of a matrix you use the notation matrix[row, col]. Examples as follows:

letter.mat[2,]   # returns the 2nd row
letter.mat[,3]   # returns the 3rd column
letter.mat[2,3]  # returns the element at row=2 and column=3
letter.mat[,-2]  # returns all but the 2nd column
letter.mat[3:4,] # returns rows 3 to 4

2.6.6 Data frame

The data frame is to a matrix what a list is to a vector. Like a matrix it is a two dimensional table for storing data and is made up of equal length rows and columns, but like a list, the columns can contain data of different data types. Although within a column, the data must all be of the same type.

Common operations for data frames:

Operation Description
data.frame() creates a data.frame
nrow() returns number of rows
ncol() returns number of columns
dim() returns the dimension
colnames(x) returns the column names of a data frame
colnames(x) <- c('a','b','c') sets the column names for a data frame
rownames(x) returns the rownames of a data frame
rownames(x) <- c('p1','p2','p3') sets the rownames for a data frame

Follow the code chunk below to start exploring data frames. The comments explain what is happening in each stage. The first step tries creating a data frame using the same format as creating a matrix. This doesn’t work though, see if you can work out what is actually happening here.

# Try creating a data frame using the same terminology as a matrix
data.frame(c(LETTERS[1:12]),nrow=4, byrow=TRUE)

# Now build a data frame from a temporary matrix
as.data.frame(matrix(c(LETTERS[1:12]),nrow=4, byrow=TRUE))
exampleDf <- as.data.frame(matrix(c(LETTERS[1:12]),nrow=4, byrow=TRUE))

# It's often useful to name the columns and rows of a data frame (or matrix)
colnames(exampleDf) <- c("x", "y", "z")
rownames(exampleDf)

# What is the size and structure of our data frame
dim(exampleDf)  # Dimension of the data frame
ncol(exampleDf) # Number of columns
nrow(exampleDf) # Number of rows
str(exampleDf)  # Structure information

2.6.6.1 Indexing a data frame

Indexing is the same as matrix using the syntax [row,col]:

exampleDf[,1]

# Then the contents of the first row
exampleDf[1,]

# Finally, add a fourth column of a different type to the data frame
exampleDf[,4] <- c(1,2,3,4)
exampleDf
class(exampleDf[,3])
class(exampleDf[,4])

Just like lists, if you have column names in your data frame, you can use the $ notation:

colnames(exampleDf)
## [1] "x"  "y"  "z"  "V4"
exampleDf$x
## [1] A D G J
## Levels: A D G J