Differentially expressed gene analysis of RNA-seq data using R

2.9 Differential expression analysis

2.9.1 Fitting the model

To fit the model, use the lmFit() function, which takes in the normalised data object and the design matrix:

fit <- lmFit(y,design)

2.9.2 Comparisons

To have a look at the pairwise comparison(s) of interest, we may need to create a constrast matrix. This is usually always the case when there are multiple factors being compared. To do this, we use the makeConstrasts() function:

cont.matrix <- makeContrasts(liver-brain, levels=design)
cont.matrix

##        Contrasts
## Levels  liver - brain
##   brain            -1
##   liver             1

Refit the model using the comparisons defined:

fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
options(digits = 3)

dim(fit2)

## [1] 15050     1

2.9.3 Extract top DE features

The topTable function summarises the output from limma in a table format. Significant DE genes for a particular comparison can be identified by selecting genes with a p-value smaller than a chosen cut-off value and/or a fold change greater than a chosen value in this table. By default the table will be sorted by increasing adjusted p-value, showing the most significant DE genes at the top.

Set the threshold values:

THRES_PVAL <- 0.01
THRES_FC <- 3

colnames(fit$coefficients)

## [1] "brain" "liver"

Get the output table for the 10 most significant DE genes for this comparison: \(liver vs brain\).

comparison='liver - brain'
topTable(fit2, coef=comparison)

##          logFC   AveExpr         t      P.Value    adj.P.Val        B
## 3240 10.818058  7.853145 100.71027 5.603924e-14 7.444180e-10 19.18750
## 5265 10.152815  7.487355  93.96364 9.892598e-14 7.444180e-10 19.01571
## 213  11.221081 10.966981  71.22374 9.577714e-13 2.059208e-09 18.47777
## 2335  6.797037  8.578822  67.73067 1.445862e-12 2.439635e-09 18.46460
## 338  11.093595  8.521278  71.76248 9.004441e-13 2.059208e-09 18.29472
## 2243 11.101339  7.402822  79.82534 3.763856e-13 1.461517e-09 18.09575
## 2244 11.177598  7.179453  75.38332 6.016490e-13 1.810963e-09 17.71017
## 229  10.583832  6.610364  79.51871 3.884431e-13 1.461517e-09 17.65905
## 125  10.915521  7.395653  67.65636 1.458918e-12 2.439635e-09 17.57556
## 716   7.466281  6.769933  59.39239 4.238953e-12 4.253083e-09 17.54615

Get the full table (n is the number of genes in the fit):

limma.result <- topTable(fit2, coef=comparison, n=nrow(fit2))

## Get significant DEGs only (adjusted p-value < THRES_PVAL)
limma.sigP.DEG <- topTable(fit2, coef=comparison, n=nrow(fit2), p.val=THRES_PVAL)
dim(limma.sigP.DEG)

## [1] 8190    6

head(limma.sigP.DEG,10)

##          logFC   AveExpr         t      P.Value    adj.P.Val        B
## 3240 10.818058  7.853145 100.71027 5.603924e-14 7.444180e-10 19.18750
## 5265 10.152815  7.487355  93.96364 9.892598e-14 7.444180e-10 19.01571
## 213  11.221081 10.966981  71.22374 9.577714e-13 2.059208e-09 18.47777
## 2335  6.797037  8.578822  67.73067 1.445862e-12 2.439635e-09 18.46460
## 338  11.093595  8.521278  71.76248 9.004441e-13 2.059208e-09 18.29472
## 2243 11.101339  7.402822  79.82534 3.763856e-13 1.461517e-09 18.09575
## 2244 11.177598  7.179453  75.38332 6.016490e-13 1.810963e-09 17.71017
## 229  10.583832  6.610364  79.51871 3.884431e-13 1.461517e-09 17.65905
## 125  10.915521  7.395653  67.65636 1.458918e-12 2.439635e-09 17.57556
## 716   7.466281  6.769933  59.39239 4.238953e-12 4.253083e-09 17.54615

Get significant DEG with low adjusted p-values and high fold change

limma.sigFC.DEG <- subset(limma.sigP.DEG, logFC > THRES_FC)
dim(limma.sigFC.DEG)

## [1] 1291    6

head(limma.sigFC.DEG,10)

##          logFC   AveExpr         t      P.Value    adj.P.Val        B
## 3240 10.818058  7.853145 100.71027 5.603924e-14 7.444180e-10 19.18750
## 5265 10.152815  7.487355  93.96364 9.892598e-14 7.444180e-10 19.01571
## 213  11.221081 10.966981  71.22374 9.577714e-13 2.059208e-09 18.47777
## 2335  6.797037  8.578822  67.73067 1.445862e-12 2.439635e-09 18.46460
## 338  11.093595  8.521278  71.76248 9.004441e-13 2.059208e-09 18.29472
## 2243 11.101339  7.402822  79.82534 3.763856e-13 1.461517e-09 18.09575
## 2244 11.177598  7.179453  75.38332 6.016490e-13 1.810963e-09 17.71017
## 229  10.583832  6.610364  79.51871 3.884431e-13 1.461517e-09 17.65905
## 125  10.915521  7.395653  67.65636 1.458918e-12 2.439635e-09 17.57556
## 716   7.466281  6.769933  59.39239 4.238953e-12 4.253083e-09 17.54615

Write the limma output table for significant genes to a tab-delimited file:

filename <- paste0("DEG_",comparison,
                   "_pval_",THRES_PVAL,
                   "_logFC_", THRES_FC, ".txt");
write.table(limma.sigFC.DEG, file=file.path(RESULTS_DIR,filename),
            row.names=T, quote=F, sep="\t")
filename

## [1] "DEG_liver - brain_pval_0.01_logFC_3.txt"

Get the number of DE genes between technical group 1 and technical group 2 (all Brain samples) with adj pvalue<0.01

Create a new design matrix for limma with the technical replicate groups
Re-normalise the read counts with ‘voom’ function with new design matrix
Fit a linear model on these normalised data
Make the contrast matrix corresponding to the new set of parameters
Fit the contrast matrix to the linear model
Compute moderated t-statistics of differential expression
Get the output table for the 10 most significant DE genes for this comparison
Get the number of genes significantly DE (adjusted p-value < 0.01)