5.2 Remove uninformative measurements

The Golub dataset that we have been using so far contains around 7,000 measurements (gene intensities). Much of this data may be redundant or uninformative, so in this section we will explore ways of filtering out low-value information from a dataset.

During our transformation stage, we applied a floor and ceiling value to our measurements, because values outside of this range are considered unreliable. If all the measurements for a gene were below the floor value, or above the ceiling, we can conclude that the expression value of that gene is uniformly outside the useful range of measurement, and ignore the data for that gene. Also, since this is a differential expression analysis, we are not interested in genes which show low levels of variability across the samples, so we will remove them too.

  1. Use the apply() function to generate a new matrix which excludes genes with an intensity of 100 or less across all samples (use the dataset from which you have removed the outliers).

Hint: break the problem down first, find the maximum intensity for each gene, then identify for which genes that maximum is more than 100, then finally slice just those rows to a new matrix

  1. Next, filter this new matrix to remove all genes with an intensity of 16,000 or more for all samples

  2. Finally, filter the matrix once more to remove all genes with less than a five-fold change in signal across the samples