2.11 Gene Annotation
The annotation of EntrezGene IDs from RNAseq data can be done using the BioMart database which contains many species including Human, Mouse, Zebrafish, Chicken and Rat.
Get the Ensembl annotation for human genome. Since we used the build-in hg19
annotation file that comes with Rsubread, we will specify the database version (GRCh=37
) as a argument in the command.
library(biomaRt)
mart<- useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",GRCh=37)
Get the Entrez gene IDs from the list of significant DEGs as identified using limma-voom
:
DEG.entrezID <- rownames(limma.sigFC.DEG)
head(DEG.entrezID)
## [1] "3240" "5265" "213" "2335" "338" "2243"
Query the BioMart database to get the gene symbols and description for these genes:
DEG.annot <- getBM(filters= "entrezgene",
attributes= c("entrezgene","external_gene_name","description"),
values= DEG.entrezID,
mart= mart)
dim(DEG.annot)
## [1] 1262 3
head(DEG.annot)
## entrezgene external_gene_name
## 1 10 NAT2
## 2 100128553 CTAGE4
## 3 100128553 CTAGE9
## 4 100128893 GATA6-AS1
## 5 100129046 RP5-1033H22.2
## 6 100130231 LINC00861
## description
## 1 N-acetyltransferase 2 (arylamine N-acetyltransferase) [Source:HGNC Symbol;Acc:7646]
## 2 CTAGE family, member 4 [Source:HGNC Symbol;Acc:24772]
## 3 CTAGE family, member 9 [Source:HGNC Symbol;Acc:37275]
## 4 GATA6 antisense RNA 1 (head to head) [Source:HGNC Symbol;Acc:48840]
## 5
## 6 long intergenic non-protein coding RNA 861 [Source:HGNC Symbol;Acc:45133]
In many cases, several annotations are available per entrez gene ID. This results in duplicate entries in the output table from getBM()
. The simplest way to deal with this issue is to remove duplicates, although they can also be concatenated in some ways.
Once the annotation has been obtained for all DE genes, this table can be merged with the output table from limma for a complete result and an easier interpretation.
DEG.annot.unique <- DEG.annot[-which(duplicated(DEG.annot$entrezgene)),]
dim(DEG.annot.unique)
## [1] 1221 3
head(DEG.annot.unique)
## entrezgene external_gene_name
## 1 10 NAT2
## 2 100128553 CTAGE4
## 4 100128893 GATA6-AS1
## 5 100129046 RP5-1033H22.2
## 6 100130231 LINC00861
## 7 100131733 USP30-AS1
## description
## 1 N-acetyltransferase 2 (arylamine N-acetyltransferase) [Source:HGNC Symbol;Acc:7646]
## 2 CTAGE family, member 4 [Source:HGNC Symbol;Acc:24772]
## 4 GATA6 antisense RNA 1 (head to head) [Source:HGNC Symbol;Acc:48840]
## 5
## 6 long intergenic non-protein coding RNA 861 [Source:HGNC Symbol;Acc:45133]
## 7 USP30 antisense RNA 1 [Source:HGNC Symbol;Acc:40909]
rownames(DEG.annot.unique) <- DEG.annot.unique$entrezgene
entrezGenes.annot <- DEG.annot.unique[DEG.entrezID,]
limma.sigFC.DEG <- cbind(entrezGenes.annot,limma.sigFC.DEG)
head(limma.sigFC.DEG)
## entrezgene external_gene_name
## 3240 3240 HP
## 5265 5265 SERPINA1
## 213 213 ALB
## 2335 2335 FN1
## 338 338 APOB
## 2243 2243 FGA
## description
## 3240 haptoglobin [Source:HGNC Symbol;Acc:5141]
## 5265 serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1 [Source:HGNC Symbol;Acc:8941]
## 213 albumin [Source:HGNC Symbol;Acc:399]
## 2335 fibronectin 1 [Source:HGNC Symbol;Acc:3778]
## 338 apolipoprotein B [Source:HGNC Symbol;Acc:603]
## 2243 fibrinogen alpha chain [Source:HGNC Symbol;Acc:3661]
## logFC AveExpr t P.Value adj.P.Val B
## 3240 10.818058 7.853145 100.71027 5.603924e-14 7.444180e-10 19.18750
## 5265 10.152815 7.487355 93.96364 9.892598e-14 7.444180e-10 19.01571
## 213 11.221081 10.966981 71.22374 9.577714e-13 2.059208e-09 18.47777
## 2335 6.797037 8.578822 67.73067 1.445862e-12 2.439635e-09 18.46460
## 338 11.093595 8.521278 71.76248 9.004441e-13 2.059208e-09 18.29472
## 2243 11.101339 7.402822 79.82534 3.763856e-13 1.461517e-09 18.09575
You can find the list of attributes to extract from the getBM()
function by using the following command: View(listAttributes(mart))
.