Data preparation, processing and interpretation with R

5.1 Removing samples

Our statistical colleagues have now performed some tests on the Golub data, and have warned us that there are some outliers in the dataset. These are characterised by low or high mean intensity values, and large or small standard deviations. Being statisticians who like a joke though, they haven’t told us which samples are affected. To be careful, we will remove the eight most extreme samples, two with the highest and two with the lowest mean intensity, and likewise two each with the highest and lowest standard deviation

Some functions that you might find useful in this exercise are apply, colMeans, order, which.min and which.max .

Using the full Golub matrix (before log transforming and normalisation), identify and remove:

The two samples with the highest mean
The two samples with the lowest mean
The two samples with the highest standard deviation
The two samples with the lowest standard deviation

Warning: When performing such data transformations, becareful about how you are performing the steps.

Do you perform each filtration one after the other? That is, filter (a) then filter (b) then filter (c) and then filter (d).

Do you first find the answers to (a), (b), (c) and (d) before performing the filtration in one go?

Try both approaches above and compare your results. Hint: After each step (a) to (d) look at which sample has the two highest or lowest mean and standard deviation.