6.2 Combining two files with rows in common
cbind
relies on the datasets being of the same size and in the same order. But what if that is not the case, and not all values are recorded for all samples? Well, if there is a common identifier between the datasets, we can use that to merge two dataframes. As an example, we will try with the built-in R datasets “beaver1” and “beaver2”, which give body temperature plots of two beavers over a period of about 18 hours.
For a full list and descriptions of R’s built in example datasets, use the data()
command
head(beaver1)
## day time temp activ
## 1 346 840 36.33 0
## 2 346 850 36.34 0
## 3 346 900 36.35 0
## 4 346 910 36.42 0
## 5 346 920 36.55 0
## 6 346 930 36.69 0
For this example, we will merge the tables on the ‘time’ column, so we can start to look how beavers’ body temperature fluctuates through the day
dim(beaver1)
## [1] 114 4
dim(beaver2)
## [1] 100 4
beaver.temp <- merge(beaver1, beaver2, by = "time", all = TRUE, sort = FALSE)
dim(beaver.temp)
## [1] 115 7
head(beaver.temp)
## time day.x temp.x activ.x day.y temp.y activ.y
## 1 930 346 36.69 0 307 36.58 0
## 2 940 346 36.71 0 307 36.73 0
## 3 950 346 36.75 0 307 36.93 0
## 4 1000 346 36.81 0 307 37.15 0
## 5 1010 346 36.88 0 307 37.23 0
## 6 1020 346 36.89 0 307 37.24 0
What’s happening here?
Function/Argument | Description |
---|---|
by |
is the column to match the two datasets with. You can use column names, numbers or (to merge on rownames) row.names . Note, if you sort by row.names , then an extra column will be created in the output object containing the rownames information, while the rownames themselves of this new object will revert to 1, 2, 3… |
all = TRUE |
means that a row will be created for every time point in either of the two samples. If one sample is missing data at that point, then values of “NA” will be entered in the output |
sort = FALSE |
by default, merge sorts the output object in the order of the merge column. In this dataset, the timecourse starts in mid-morning, and continues until after midnight the next day, so sorting is not appropriate in this case. |
Unfortunately, even turning off sort doesn’t do what we want in this case. Rather than leaving the rows in the original order, they are reordered but in a somewhat unpredicatable way. There are functions in non-core R packages (like join in the plyr
package that address this, but for the moment, a quick and dirty way is to add 2400 to all times after midnight and then sort the data by the time column:
beaver.temp[beaver.temp[, 'time'] < 800, 'time'] <-
beaver.temp[beaver.temp[, 'time'] < 800, 'time'] + 2400
beaver.temp <- beaver.temp[order(beaver.temp[,'time']),]
-
The golub_cbind.RData file you loaded earlier contains two, partially overlapping, datasets from the all and aml samples (all.subset and aml.subset). Use
merge()
to join these two dataframes using rownames as identifiers -
For the dataframe generated by the merge, relabel the rownames correctly, and remove the now redundant “Row.names” column from the dataframe