Data preparation, processing and interpretation with R

5.1 Data manipulation

5.1.2 Removing samples

# See what we're starting with
dim(golub.matrix)

## [1] 7129   38

# Identify and remove the two samples with the highest and lowest mean
# Yes, we can do this in one, rather complicated fcommand
golub.no.outliers <- golub.matrix[,-(order(colMeans(golub.matrix))[
                                    c(1,2,ncol(golub.matrix),ncol(golub.matrix)-1)])]

# Leaving us with just 34 columns
dim(golub.no.outliers)

## [1] 7129   34

What’s happening here?

Function	Description
`colMeans`	is calculating the mean for each column (you could also use apply)
`order`	is giving the column numbers in ascending order of mean value
`c(1,2, ncol...)`	retrieves the first two and the last two entries from that order list - i.e. the column numbers with the two lowest and highest means

Finally we take a slice of the golub matrix, exluding those four columns using the - annotation.

# Repeat the process for standard deviation. There is no colSD function,
# so we will use apply instead
golub.no.outliers <- golub.no.outliers[,-(order(apply(golub.no.outliers, 2,sd))[
                          c(1,2,ncol(golub.no.outliers),ncol(golub.no.outliers)-1)])]

While there is a certain satisfaction in building a command that identifies and removes the two highest and lowest mean samples in a single step, any solution which reliably produces the same outcome is a suitable answer. There are any number of ways to answer this exercise. For example, you could create a vector containing the mean values for each column and use the which.min and which.max commands on that to find which column numbers to remove.

5.1.2 Remove uninformative measurements

## to be save, we make a copy of the original data before filtering
tmp <- golub.no.outliers
nsamples <- ncol(tmp)
dim(tmp)

## [1] 7129   30

## remove rows where all genes are <=100
below.floor <- tmp <= 100
remove <- which(rowSums(below.floor) == nsamples)
tmp <- tmp[-remove,]
dim(tmp)

## [1] 5954   30

## remove rows where all genes are >=16,000
above.ceil <- tmp >= 16000
remove <- which(rowSums(above.ceil) == nsamples)
tmp <- tmp[-remove,]
dim(tmp)

## [1] 5939   30

## remove rows with less than 5 fold change across samples
ratios <- apply(tmp,1,max)/apply(tmp,1,min)
keep <- which(ratios > 5)
tmp <- tmp[keep,]
dim(tmp)

## [1] 2902   30

golub.filtered <- tmp;
rm(tmp);

5.1.3 Extracting specific measurements

# First, read in the file and convert the contents into a named vector
golub100.gnames.tmp <- read.table(file.path(DATADIR, "golub100.txt"),
                                  stringsAsFactors = FALSE)
golub100.gnames <- golub100.gnames.tmp[,2]
names(golub100.gnames) <- golub100.gnames.tmp[,1]

# From the log normalised matrix, extract the 100 rows of interest
golub.100 <- golub.norm[names(golub100.gnames),]