5.1 Data manipulation

5.1.2 Removing samples

# See what we're starting with
dim(golub.matrix)
## [1] 7129   38
# Identify and remove the two samples with the highest and lowest mean
# Yes, we can do this in one, rather complicated fcommand
golub.no.outliers <- golub.matrix[,-(order(colMeans(golub.matrix))[
                                    c(1,2,ncol(golub.matrix),ncol(golub.matrix)-1)])]

# Leaving us with just 34 columns
dim(golub.no.outliers)
## [1] 7129   34

What’s happening here?

Function Description
colMeans is calculating the mean for each column (you could also use apply)
order is giving the column numbers in ascending order of mean value
c(1,2, ncol...) retrieves the first two and the last two entries from that order list - i.e. the column numbers with the two lowest and highest means

Finally we take a slice of the golub matrix, exluding those four columns using the - annotation.

# Repeat the process for standard deviation. There is no colSD function,
# so we will use apply instead
golub.no.outliers <- golub.no.outliers[,-(order(apply(golub.no.outliers, 2,sd))[
                          c(1,2,ncol(golub.no.outliers),ncol(golub.no.outliers)-1)])]

While there is a certain satisfaction in building a command that identifies and removes the two highest and lowest mean samples in a single step, any solution which reliably produces the same outcome is a suitable answer. There are any number of ways to answer this exercise. For example, you could create a vector containing the mean values for each column and use the which.min and which.max commands on that to find which column numbers to remove.

5.1.2 Remove uninformative measurements

## to be save, we make a copy of the original data before filtering
tmp <- golub.no.outliers
nsamples <- ncol(tmp)
dim(tmp)
## [1] 7129   30
## remove rows where all genes are <=100
below.floor <- tmp <= 100
remove <- which(rowSums(below.floor) == nsamples)
tmp <- tmp[-remove,]
dim(tmp)
## [1] 5954   30
## remove rows where all genes are >=16,000
above.ceil <- tmp >= 16000
remove <- which(rowSums(above.ceil) == nsamples)
tmp <- tmp[-remove,]
dim(tmp)
## [1] 5939   30
## remove rows with less than 5 fold change across samples
ratios <- apply(tmp,1,max)/apply(tmp,1,min)
keep <- which(ratios > 5)
tmp <- tmp[keep,]
dim(tmp)
## [1] 2902   30
golub.filtered <- tmp;
rm(tmp);

5.1.3 Extracting specific measurements

# First, read in the file and convert the contents into a named vector
golub100.gnames.tmp <- read.table(file.path(DATADIR, "golub100.txt"),
                                  stringsAsFactors = FALSE)
golub100.gnames <- golub100.gnames.tmp[,2]
names(golub100.gnames) <- golub100.gnames.tmp[,1]

# From the log normalised matrix, extract the 100 rows of interest
golub.100 <- golub.norm[names(golub100.gnames),]