Data preparation, processing and interpretation with R

6.2 Combining two files with rows in common

cbind relies on the datasets being of the same size and in the same order. But what if that is not the case, and not all values are recorded for all samples? Well, if there is a common identifier between the datasets, we can use that to merge two dataframes. As an example, we will try with the built-in R datasets “beaver1” and “beaver2”, which give body temperature plots of two beavers over a period of about 18 hours.

For a full list and descriptions of R’s built in example datasets, use the data() command

head(beaver1)

##   day time  temp activ
## 1 346  840 36.33     0
## 2 346  850 36.34     0
## 3 346  900 36.35     0
## 4 346  910 36.42     0
## 5 346  920 36.55     0
## 6 346  930 36.69     0

For this example, we will merge the tables on the ‘time’ column, so we can start to look how beavers’ body temperature fluctuates through the day

dim(beaver1)

## [1] 114   4

dim(beaver2)

## [1] 100   4

beaver.temp <- merge(beaver1, beaver2, by = "time", all = TRUE, sort = FALSE)
dim(beaver.temp)

## [1] 115   7

head(beaver.temp)

##   time day.x temp.x activ.x day.y temp.y activ.y
## 1  930   346  36.69       0   307  36.58       0
## 2  940   346  36.71       0   307  36.73       0
## 3  950   346  36.75       0   307  36.93       0
## 4 1000   346  36.81       0   307  37.15       0
## 5 1010   346  36.88       0   307  37.23       0
## 6 1020   346  36.89       0   307  37.24       0

What’s happening here?

Function/Argument	Description
`by`	is the column to match the two datasets with. You can use column names, numbers or (to merge on rownames) `row.names`. Note, if you sort by `row.names`, then an extra column will be created in the output object containing the rownames information, while the rownames themselves of this new object will revert to 1, 2, 3…
`all = TRUE`	means that a row will be created for every time point in either of the two samples. If one sample is missing data at that point, then values of “NA” will be entered in the output
`sort = FALSE`	by default, merge sorts the output object in the order of the merge column. In this dataset, the timecourse starts in mid-morning, and continues until after midnight the next day, so sorting is not appropriate in this case.

Unfortunately, even turning off sort doesn’t do what we want in this case. Rather than leaving the rows in the original order, they are reordered but in a somewhat unpredicatable way. There are functions in non-core R packages (like join in the plyr package that address this, but for the moment, a quick and dirty way is to add 2400 to all times after midnight and then sort the data by the time column:

beaver.temp[beaver.temp[, 'time'] < 800, 'time'] <- 
      beaver.temp[beaver.temp[, 'time'] < 800, 'time'] + 2400
beaver.temp <- beaver.temp[order(beaver.temp[,'time']),]

The golub_cbind.RData file you loaded earlier contains two, partially overlapping, datasets from the all and aml samples (all.subset and aml.subset). Use merge() to join these two dataframes using rownames as identifiers
For the dataframe generated by the merge, relabel the rownames correctly, and remove the now redundant “Row.names” column from the dataframe