2.2 Mapping
Once the reads have been quality checked and trimmed, the next step is to map the reads to the reference genome (in our case the human genome “hg19”). This can be done with the Bioconductor package Rsubread
.
Before mapping the reads to the reference genome you will need to build a Rsubread index for that genome. Below are the commands for building an index for the human reference genome using the buildindex
command.
PLEASE DO NOT RUN the buildindex()
code in the workshop as this can take awhile. We have already build the index for you.
library(Rsubread)
buildindex(basename=file.path(RSUBREAD_INDEX_BASE, RSUBREAD_INDEX_PATH), reference=REF_GENOME)
Once the Rsubread index has been created you can map your reads to the genome by running the align
command. The code below could be used to map the reads for a specific library against the “hg19” genome.
library(Rsubread)
sample <- "ERR420386"
inputFileFWD <- file.path(RNASeq_DATA_DIR, paste0(sample,"_chr21_R1.fastq.gz"))
inputFileRVS <- file.path(RNASeq_DATA_DIR, paste0(sample,"_chr21_R2.fastq.gz"))
output.bamFile <- file.path(MAPPING_DIR, paste0(sample,".bam"))
inputFileFWD
inputFileRVS
output.bamFile
## [1] "../data/RNAseq/raw_data/ERR420386_chr21_R1.fastq.gz"
## [1] "../data/RNAseq/raw_data/ERR420386_chr21_R2.fastq.gz"
## [1] "results/RNAseq/mapping/ERR420386.bam"
For the purpose of this workshop the mapping has already been done. This step can take up to a couple of hours per library.
Please only run the following command using the subset sample *_chr21_R1.fastq.gz
, which is much smaller.
align(index = file.path(RSUBREAD_INDEX_PATH, RSUBREAD_INDEX_BASE),
readfile1 = inputFileFWD,
readfile2 = inputFileRVS,
input_format = "gzFASTQ",
output_file = output.bamFile,
output_format = "BAM")
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 1.24.2
##
## //========================== subread-align setting ===========================\\
## || ||
## || Function : Read alignment (RNA-Seq) ||
## || Input file 1 : ../data/RNAseq/raw_data/ERR420386_chr21_R1.fastq.gz ||
## || Input file 2 : ../data/RNAseq/raw_data/ERR420386_chr21_R2.fastq.gz ||
## || Output file : results/RNAseq/mapping/ERR420386.bam (BAM) ||
## || Index name : ../data/RNAseq/ref_data/RsubreadIndex/hg19 ||
## || ||
## || ------------------------------------ ||
## || ||
## || Threads : 1 ||
## || Phred offset : 33 ||
## || # of extracted subreads : 10 ||
## || Min read1 vote : 3 ||
## || Min read2 vote : 1 ||
## || Max fragment size : 600 ||
## || Min fragment size : 50 ||
## || Maximum allowed mismatches : 3 ||
## || Maximum allowed indel bases : 5 ||
## || # of best alignments reported : 1 ||
## || Unique mapping : yes ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
##
## //================ Running (12-Apr-2017 20:38:48, pid=31533) =================\\
## || ||
## || The input file contains base space reads. ||
## || The range of Phred scores observed in the data is [2,40] ||
## || Load the 1-th index block... ||
## || 4% completed, 0.2 mins elapsed, rate=4.1k fragments per second ||
## || 10% completed, 0.2 mins elapsed, rate=4.1k fragments per second ||
## || 17% completed, 0.3 mins elapsed, rate=4.1k fragments per second ||
## || 24% completed, 0.3 mins elapsed, rate=4.1k fragments per second ||
## || 30% completed, 0.4 mins elapsed, rate=4.1k fragments per second ||
## || 37% completed, 0.4 mins elapsed, rate=4.1k fragments per second ||
## || 44% completed, 0.5 mins elapsed, rate=4.0k fragments per second ||
## || 50% completed, 0.5 mins elapsed, rate=4.0k fragments per second ||
## || 57% completed, 0.5 mins elapsed, rate=4.0k fragments per second ||
## || 64% completed, 0.6 mins elapsed, rate=4.0k fragments per second ||
## || 69% completed, 0.6 mins elapsed, rate=3.0k fragments per second ||
## || 73% completed, 0.6 mins elapsed, rate=3.1k fragments per second ||
## || 76% completed, 0.7 mins elapsed, rate=3.2k fragments per second ||
## || 80% completed, 0.7 mins elapsed, rate=3.2k fragments per second ||
## || 83% completed, 0.7 mins elapsed, rate=3.3k fragments per second ||
## || 86% completed, 0.7 mins elapsed, rate=3.4k fragments per second ||
## || 90% completed, 0.7 mins elapsed, rate=3.5k fragments per second ||
## || 93% completed, 0.7 mins elapsed, rate=3.5k fragments per second ||
## || 96% completed, 0.7 mins elapsed, rate=3.6k fragments per second ||
## || ||
## || Completed successfully. ||
## || ||
## \\============================================================================//
##
## //================================= Summary ==================================\\
## || ||
## || Processed : 161,297 fragments ||
## || Mapped : 160,518 fragments (99.5%) ||
## || Correctly paired : 126,376 fragments ||
## || Indels : 2,635 ||
## || ||
## || Running time : 0.7 minutes ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
The nthreads
parameter can be used in the align command to speed up the process and run the alignment using several CPUs in parallel.
The function propmapped
returns the proportion of mapped reads in the output SAM file: total number of input reads, number of mapped reads and proportion of mapped reads.
propmapped(output.bamFile)
## The input file is opened as a BAM file.
## The fragments in the input file are being counted.
## Finished. All records: 322594; all fragments: 161297; mapped fragments: 160518; the mappability is 99.52%
## Samples NumTotal NumMapped PropMapped
## 1 results/RNAseq/mapping/ERR420386.bam 161297 160518 0.99517
You can run the propmapped()
on multiple bam
files to return a summary of the total number of reads per file and the number of reads that were mappable or unmappable. However, this can take a very long time to run for big bam files.
PLEASE DO NOT RUN
For example:
all.bam.files <- grep(‘.bam’,dir(‘../data/RNAseq/mapping’,full.names = T),value=T)
pm <- propmapped(all.bam.files)
2.2.1 Examining the mapped reads
Create a BamFile
object and load the file into memory so we can interact with it and find out some information. The seqinfo()
function outputs the headding information, in this exercise, this is the
library(Rsamtools)
bf <- BamFile(output.bamFile)
seqinfo(bf)
## Seqinfo object with 25 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## chrM 16571 <NA> <NA>
## chr1 249250621 <NA> <NA>
## chr2 243199373 <NA> <NA>
## chr3 198022430 <NA> <NA>
## chr4 191154276 <NA> <NA>
## ... ... ... ...
## chr20 63025520 <NA> <NA>
## chr21 48129895 <NA> <NA>
## chr22 51304566 <NA> <NA>
## chrX 155270560 <NA> <NA>
## chrY 59373566 <NA> <NA>
We can take a closer look and find out how many of the reads map to each chromosome. To do this, we need to first sort and index the bam
file.
output.sorted.bamFile <- file.path(MAPPING_DIR,paste0(sample, '.sorted'))
sortBam(output.bamFile, output.sorted.bamFile)
## [1] "results/RNAseq/mapping/ERR420386.sorted.bam"
output.sorted.bamFile <- paste0(output.sorted.bamFile, ".bam")
indexBam(output.sorted.bamFile)
## results/RNAseq/mapping/ERR420386.sorted.bam
## "results/RNAseq/mapping/ERR420386.sorted.bam.bai"
dir(MAPPING_DIR, full.names=T)
## [1] "results/RNAseq/mapping/ERR420386.bam"
## [2] "results/RNAseq/mapping/ERR420386.bam.indel"
## [3] "results/RNAseq/mapping/ERR420386.sorted.bam"
## [4] "results/RNAseq/mapping/ERR420386.sorted.bam.bai"
output.bam.index <- dir(MAPPING_DIR, full.names=T)[grep(".bai",dir(MAPPING_DIR))]
output.bam.index
## [1] "results/RNAseq/mapping/ERR420386.sorted.bam.bai"
Once the index bam file has been created, we can find out the number of mapped reads per chromosome:
chr.mapping.stats <- idxstatsBam(output.bamFile, index=output.bam.index)
chr.mapping.stats
## seqnames seqlength mapped unmapped
## 1 chr1 249250621 150 0
## 2 chr2 243199373 559 0
## 3 chr3 198022430 20 0
## 4 chr4 191154276 79 0
## 5 chr5 180915260 892 0
## 6 chr6 171115067 200 0
## 7 chr7 159138663 850 0
## 8 chr8 146364022 7 0
## 9 chr9 141213431 101 0
## 10 chr10 135534747 10 0
## 11 chr11 135006516 212 0
## 12 chr12 133851895 155 0
## 13 chr13 115169878 42 0
## 14 chr14 107349540 19 0
## 15 chr15 102531392 56 0
## 16 chr16 90354753 924 0
## 17 chr17 81195210 6 0
## 18 chr18 78077248 27 0
## 19 chr19 59128983 24 0
## 20 chr20 63025520 20 0
## 21 chr21 48129895 314047 0
## 22 chr22 51304566 15 0
## 23 chrX 155270560 46 0
## 24 chrY 59373566 66 0
## 25 chrM 16571 0 0
This is easiest to view as a plot:
rownames(chr.mapping.stats) <- chr.mapping.stats$seqnames
barplot(chr.mapping.stats$mapped,
names.arg=as.character(chr.mapping.stats$seqnames))
Mapping percentage
Using a barplot, can you find out which other chromsome has the highest number of mapped reads?
Hint: repeat the barplot but without the bar for chr21
.
Solution
total.mapped.reads <- sum(chr.mapping.stats\(mapped)</code><br /> <code>chr.mapping.stats\)mapped.prop <- chr.mapping.stats\(mapped/total.mapped.reads*100</code><br /> <code>barplot(chr.mapping.stats\)mapped.prop[-21],
names.arg=chr.mapping.stats$seqnames[-21],
las=2, col=‘slateblue’, ylim=c(0,1))