Differentially expressed gene analysis of RNA-seq data using R

2.11 Gene Annotation

The annotation of EntrezGene IDs from RNAseq data can be done using the BioMart database which contains many species including Human, Mouse, Zebrafish, Chicken and Rat.

Get the Ensembl annotation for human genome. Since we used the build-in hg19 annotation file that comes with Rsubread, we will specify the database version (GRCh=37) as a argument in the command.

library(biomaRt)
mart<- useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",GRCh=37)

Get the Entrez gene IDs from the list of significant DEGs as identified using limma-voom:

DEG.entrezID <- rownames(limma.sigFC.DEG)
head(DEG.entrezID)

## [1] "3240" "5265" "213"  "2335" "338"  "2243"

Query the BioMart database to get the gene symbols and description for these genes:

DEG.annot <- getBM(filters= "entrezgene",
                   attributes= c("entrezgene","external_gene_name","description"),
                   values= DEG.entrezID,
                   mart= mart)

dim(DEG.annot)

## [1] 1262    3

head(DEG.annot)

##   entrezgene external_gene_name
## 1         10               NAT2
## 2  100128553             CTAGE4
## 3  100128553             CTAGE9
## 4  100128893          GATA6-AS1
## 5  100129046      RP5-1033H22.2
## 6  100130231          LINC00861
##                                                                           description
## 1 N-acetyltransferase 2 (arylamine N-acetyltransferase) [Source:HGNC Symbol;Acc:7646]
## 2                               CTAGE family, member 4 [Source:HGNC Symbol;Acc:24772]
## 3                               CTAGE family, member 9 [Source:HGNC Symbol;Acc:37275]
## 4                 GATA6 antisense RNA 1 (head to head) [Source:HGNC Symbol;Acc:48840]
## 5                                                                                    
## 6           long intergenic non-protein coding RNA 861 [Source:HGNC Symbol;Acc:45133]

In many cases, several annotations are available per entrez gene ID. This results in duplicate entries in the output table from getBM(). The simplest way to deal with this issue is to remove duplicates, although they can also be concatenated in some ways.

Once the annotation has been obtained for all DE genes, this table can be merged with the output table from limma for a complete result and an easier interpretation.

DEG.annot.unique <- DEG.annot[-which(duplicated(DEG.annot$entrezgene)),]
dim(DEG.annot.unique)

## [1] 1221    3

head(DEG.annot.unique)

##   entrezgene external_gene_name
## 1         10               NAT2
## 2  100128553             CTAGE4
## 4  100128893          GATA6-AS1
## 5  100129046      RP5-1033H22.2
## 6  100130231          LINC00861
## 7  100131733          USP30-AS1
##                                                                           description
## 1 N-acetyltransferase 2 (arylamine N-acetyltransferase) [Source:HGNC Symbol;Acc:7646]
## 2                               CTAGE family, member 4 [Source:HGNC Symbol;Acc:24772]
## 4                 GATA6 antisense RNA 1 (head to head) [Source:HGNC Symbol;Acc:48840]
## 5                                                                                    
## 6           long intergenic non-protein coding RNA 861 [Source:HGNC Symbol;Acc:45133]
## 7                                USP30 antisense RNA 1 [Source:HGNC Symbol;Acc:40909]

rownames(DEG.annot.unique) <- DEG.annot.unique$entrezgene
entrezGenes.annot <- DEG.annot.unique[DEG.entrezID,]
limma.sigFC.DEG <- cbind(entrezGenes.annot,limma.sigFC.DEG)
head(limma.sigFC.DEG)

##      entrezgene external_gene_name
## 3240       3240                 HP
## 5265       5265           SERPINA1
## 213         213                ALB
## 2335       2335                FN1
## 338         338               APOB
## 2243       2243                FGA
##                                                                                                            description
## 3240                                                                         haptoglobin [Source:HGNC Symbol;Acc:5141]
## 5265 serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1 [Source:HGNC Symbol;Acc:8941]
## 213                                                                               albumin [Source:HGNC Symbol;Acc:399]
## 2335                                                                       fibronectin 1 [Source:HGNC Symbol;Acc:3778]
## 338                                                                      apolipoprotein B [Source:HGNC Symbol;Acc:603]
## 2243                                                              fibrinogen alpha chain [Source:HGNC Symbol;Acc:3661]
##          logFC   AveExpr         t      P.Value    adj.P.Val        B
## 3240 10.818058  7.853145 100.71027 5.603924e-14 7.444180e-10 19.18750
## 5265 10.152815  7.487355  93.96364 9.892598e-14 7.444180e-10 19.01571
## 213  11.221081 10.966981  71.22374 9.577714e-13 2.059208e-09 18.47777
## 2335  6.797037  8.578822  67.73067 1.445862e-12 2.439635e-09 18.46460
## 338  11.093595  8.521278  71.76248 9.004441e-13 2.059208e-09 18.29472
## 2243 11.101339  7.402822  79.82534 3.763856e-13 1.461517e-09 18.09575

You can find the list of attributes to extract from the getBM() function by using the following command: View(listAttributes(mart)).