2.12 Gene set enrichment

Gene Ontology (GO) enrichment is a method for investigating sets of genes using the Gene Ontology system classification, in which genes are assigned to a particular set of terms for three major domains: cellular component, biological process and molecular function.

The GOstats package can test for both over and under representation of GO terms using the standard hypergeometric test. The output of the analysis is typically a ranked list of GO terms, each associated with a p-value.

The hypergeometric test will require both a list of selected genes (e.g. your DE genes) and a “universe” list (e.g. all genes represented that were tested for differential expression), all represented by their “EntrezGene” ID.

Get the list of universe entrez IDs:

library(GOstats)
## Loading required package: Category
## Loading required package: AnnotationDbi
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:MASS':
## 
##     select
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## Loading required package: graph
## 
## Attaching package: 'graph'
## The following object is masked from 'package:Biostrings':
## 
##     complement
## 
## 
## Attaching package: 'GOstats'
## The following object is masked from 'package:AnnotationDbi':
## 
##     makeGOGraph
universe.entrezID <- rownames(filtered.raw.counts)
length(universe.entrezID)
## [1] 15050

Before running the hypergeometric test with the hyperGTest function, you need to define the parameters for the test (gene lists, ontology, test direction) as well as the annotation database to be used. The ontology to be tested for can be any of the three GO domains: biological process (“BP”), cellular component (“CC”) or molecular function (“MF”).

In the example below, we will test for over-represented Biological Processes in our list of differentially expressed genes.

annotationDB <- "org.Hs.eg.db"
hgCutoff <- 0.05

params <- new("GOHyperGParams",
              geneIds=DEG.entrezID,
              universeGeneIds=universe.entrezID,
              annotation=annotationDB,
              ontology="BP",
              pvalueCutoff=hgCutoff,
              testDirection="over")
# Run the test
hg <- hyperGTest(params)
hg
## [1] 2010    7
##        GOBPID       Pvalue OddsRatio  ExpCount Count Size
## 1  GO:0006082 6.074927e-66  4.964910  73.47815   236  820
## 2  GO:0043436 1.146564e-58  4.766285  68.28092   216  762
## 3  GO:0019752 9.451120e-58  4.729580  67.92249   214  758
## 4  GO:0044281 1.899138e-54  3.336444 141.75907   326 1582
## 5  GO:0032787 9.530910e-54  5.891811  42.20513   159  471
## 6  GO:0006952 2.581186e-53  3.968852  88.26339   242  985
## 7  GO:0044710 5.216264e-51  2.692125 289.34263   509 3229
## 8  GO:0006954 1.371656e-48  5.643188  40.14416   148  448
## 9  GO:0006955 2.441550e-43  3.477192  91.04122   229 1016
## 10 GO:0008202 1.761494e-41  8.038462  20.16169    95  225
##                                     Term
## 1         organic acid metabolic process
## 2              oxoacid metabolic process
## 3      carboxylic acid metabolic process
## 4       small molecule metabolic process
## 5  monocarboxylic acid metabolic process
## 6                       defense response
## 7      single-organism metabolic process
## 8                  inflammatory response
## 9                        immune response
## 10             steroid metabolic process

We need to adjust for multiple testing using the p.adjust() function. You can specify the type of adjustment method to use, we are using bonferroni in this example. Assign the adjusted pvalues back to the hg.df data object. Reorder the columns so that the unadjusted and adjusted p-values are next to each other.

hg.df <- summary(hg)
hg.df$Adj.Pvalue <- p.adjust(hg.df$Pvalue, 'bonferroni')
hg.df <- hg.df[,c(1:2,8,3:7)]
head(hg.df,10)
##        GOBPID       Pvalue   Adj.Pvalue OddsRatio  ExpCount Count Size
## 1  GO:0006082 6.074927e-66 1.221060e-62  4.964910  73.47815   236  820
## 2  GO:0043436 1.146564e-58 2.304594e-55  4.766285  68.28092   216  762
## 3  GO:0019752 9.451120e-58 1.899675e-54  4.729580  67.92249   214  758
## 4  GO:0044281 1.899138e-54 3.817267e-51  3.336444 141.75907   326 1582
## 5  GO:0032787 9.530910e-54 1.915713e-50  5.891811  42.20513   159  471
## 6  GO:0006952 2.581186e-53 5.188184e-50  3.968852  88.26339   242  985
## 7  GO:0044710 5.216264e-51 1.048469e-47  2.692125 289.34263   509 3229
## 8  GO:0006954 1.371656e-48 2.757028e-45  5.643188  40.14416   148  448
## 9  GO:0006955 2.441550e-43 4.907515e-40  3.477192  91.04122   229 1016
## 10 GO:0008202 1.761494e-41 3.540603e-38  8.038462  20.16169    95  225
##                                     Term
## 1         organic acid metabolic process
## 2              oxoacid metabolic process
## 3      carboxylic acid metabolic process
## 4       small molecule metabolic process
## 5  monocarboxylic acid metabolic process
## 6                       defense response
## 7      single-organism metabolic process
## 8                  inflammatory response
## 9                        immune response
## 10             steroid metabolic process

Compare before and after multiple adjustment:

par(mfrow=c(1,2))
plot(density(hg.df$Pvalue),'Unadjust p-values')
abline(v=hgCutoff,col='red',lty=2)

plot(density(hg.df$Adj.Pvalue),'Adjusted p-values')
abline(v=hgCutoff,col='red',lty=2)

Keep only the significant GO terms after adjusting for multiple testing:

sigGO.table <- subset(hg.df, Adj.Pvalue < hgCutoff)
dim(sigGO.table)
## [1] 517   8
head(sigGO.table,10)
##        GOBPID       Pvalue   Adj.Pvalue OddsRatio  ExpCount Count Size
## 1  GO:0006082 6.074927e-66 1.221060e-62  4.964910  73.47815   236  820
## 2  GO:0043436 1.146564e-58 2.304594e-55  4.766285  68.28092   216  762
## 3  GO:0019752 9.451120e-58 1.899675e-54  4.729580  67.92249   214  758
## 4  GO:0044281 1.899138e-54 3.817267e-51  3.336444 141.75907   326 1582
## 5  GO:0032787 9.530910e-54 1.915713e-50  5.891811  42.20513   159  471
## 6  GO:0006952 2.581186e-53 5.188184e-50  3.968852  88.26339   242  985
## 7  GO:0044710 5.216264e-51 1.048469e-47  2.692125 289.34263   509 3229
## 8  GO:0006954 1.371656e-48 2.757028e-45  5.643188  40.14416   148  448
## 9  GO:0006955 2.441550e-43 4.907515e-40  3.477192  91.04122   229 1016
## 10 GO:0008202 1.761494e-41 3.540603e-38  8.038462  20.16169    95  225
##                                     Term
## 1         organic acid metabolic process
## 2              oxoacid metabolic process
## 3      carboxylic acid metabolic process
## 4       small molecule metabolic process
## 5  monocarboxylic acid metabolic process
## 6                       defense response
## 7      single-organism metabolic process
## 8                  inflammatory response
## 9                        immune response
## 10             steroid metabolic process

Other software can be used to investigate over-represented pathways, such as GeneGO https://portal.genego.com/ and Ingenuity http://www.ingenuity.com/products/ ipa. The advantage of these applications is that they maintain curated and up-to-date extensive databases. They also provide intuitive visualisation and network modelling tools.


Save an image of your RNAseq analysis.

save.image(file=file.path(RESULTS_DIR,"RNAseq.Rdata"))

SessionInfo

sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] biomaRt_2.30.0             tidyr_0.6.1               
##  [3] edgeR_3.16.5               limma_3.30.13             
##  [5] mixOmics_6.1.2             ggplot2_2.2.1             
##  [7] lattice_0.20-35            MASS_7.3-45               
##  [9] Rsubread_1.24.2            ShortRead_1.32.1          
## [11] GenomicAlignments_1.10.1   SummarizedExperiment_1.4.0
## [13] Biobase_2.34.0             Rsamtools_1.26.2          
## [15] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3       
## [17] BiocParallel_1.8.2         Biostrings_2.42.1         
## [19] XVector_0.14.1             IRanges_2.8.2             
## [21] S4Vectors_0.12.2           BiocGenerics_0.20.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10         locfit_1.5-9.1       corpcor_1.6.9       
##  [4] assertthat_0.2.0     rprojroot_1.2        digest_0.6.12       
##  [7] mime_0.5             R6_2.2.0             plyr_1.8.4          
## [10] backports_1.0.5      RSQLite_1.1-2        ellipse_0.3-8       
## [13] evaluate_0.10        zlibbioc_1.20.0      lazyeval_0.2.0      
## [16] rstudioapi_0.6       Matrix_1.2-8         rmarkdown_1.4       
## [19] labeling_0.3         stringr_1.2.0        htmlwidgets_0.8     
## [22] igraph_1.0.1         RCurl_1.95-4.8       munsell_0.4.3       
## [25] shiny_1.0.1          httpuv_1.3.3         htmltools_0.3.5     
## [28] tibble_1.3.0         bookdown_0.3         codetools_0.2-15    
## [31] XML_3.98-1.6         dplyr_0.5.0          bitops_1.0-6        
## [34] grid_3.3.3           jsonlite_1.4         xtable_1.8-2        
## [37] gtable_0.2.0         DBI_0.6-1            magrittr_1.5        
## [40] scales_0.4.1         stringi_1.1.5        hwriter_1.3.2       
## [43] reshape2_1.4.2       latticeExtra_0.6-28  RColorBrewer_1.1-2  
## [46] tools_3.3.3          yaml_2.1.14          AnnotationDbi_1.36.2
## [49] colorspace_1.3-2     memoise_1.0.0        knitr_1.15.1