5.1 Data manipulation
5.1.2 Removing samples
# See what we're starting with
dim(golub.matrix)
## [1] 7129 38
# Identify and remove the two samples with the highest and lowest mean
# Yes, we can do this in one, rather complicated fcommand
golub.no.outliers <- golub.matrix[,-(order(colMeans(golub.matrix))[
c(1,2,ncol(golub.matrix),ncol(golub.matrix)-1)])]
# Leaving us with just 34 columns
dim(golub.no.outliers)
## [1] 7129 34
What’s happening here?
Function | Description |
---|---|
colMeans |
is calculating the mean for each column (you could also use apply) |
order |
is giving the column numbers in ascending order of mean value |
c(1,2, ncol...) |
retrieves the first two and the last two entries from that order list - i.e. the column numbers with the two lowest and highest means |
Finally we take a slice of the golub matrix, exluding those four columns using the -
annotation.
# Repeat the process for standard deviation. There is no colSD function,
# so we will use apply instead
golub.no.outliers <- golub.no.outliers[,-(order(apply(golub.no.outliers, 2,sd))[
c(1,2,ncol(golub.no.outliers),ncol(golub.no.outliers)-1)])]
While there is a certain satisfaction in building a command that identifies and removes the two highest and lowest mean samples in a single step, any solution which reliably produces the same outcome is a suitable answer. There are any number of ways to answer this exercise. For example, you could create a vector containing the mean values for each column and use the which.min
and which.max
commands on that to find which column numbers to remove.
5.1.2 Remove uninformative measurements
## to be save, we make a copy of the original data before filtering
tmp <- golub.no.outliers
nsamples <- ncol(tmp)
dim(tmp)
## [1] 7129 30
## remove rows where all genes are <=100
below.floor <- tmp <= 100
remove <- which(rowSums(below.floor) == nsamples)
tmp <- tmp[-remove,]
dim(tmp)
## [1] 5954 30
## remove rows where all genes are >=16,000
above.ceil <- tmp >= 16000
remove <- which(rowSums(above.ceil) == nsamples)
tmp <- tmp[-remove,]
dim(tmp)
## [1] 5939 30
## remove rows with less than 5 fold change across samples
ratios <- apply(tmp,1,max)/apply(tmp,1,min)
keep <- which(ratios > 5)
tmp <- tmp[keep,]
dim(tmp)
## [1] 2902 30
golub.filtered <- tmp;
rm(tmp);
5.1.3 Extracting specific measurements
# First, read in the file and convert the contents into a named vector
golub100.gnames.tmp <- read.table(file.path(DATADIR, "golub100.txt"),
stringsAsFactors = FALSE)
golub100.gnames <- golub100.gnames.tmp[,2]
names(golub100.gnames) <- golub100.gnames.tmp[,1]
# From the log normalised matrix, extract the 100 rows of interest
golub.100 <- golub.norm[names(golub100.gnames),]