1.2 Reading from a fasta file

FASTA format is a text-based format for representing nucleotide and peptide sequences using their single-letter IUPAC codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a standard in the field of bioinformatics.

The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools that are built into R. A number of packages make the process of loading a fasta file very much easier.

library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: IRanges
## Loading required package: XVector
library(ShortRead)
## Loading required package: BiocParallel
## Loading required package: Rsamtools
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: GenomicAlignments
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
NGS_DIR <- "../data/NGS_sequences"
comt <- readFasta(file.path(NGS_DIR, "comt.fasta"))
print(comt)
## class: ShortRead
## length: 2 reads; width: 2437 28369 cycles
id(comt)
##   A BStringSet instance of length 2
##     width seq
## [1]    41 ENST00000361682 cdna:KNOWN_protein_coding
## [2]    58 22 dna:chromosome chromosome:GRCh37:22:19929130:19957498:1
sread(comt)
##   A DNAStringSet instance of length 2
##     width seq
## [1]  2437 CGGGGACACCCTGGCCACCGCCGCGCGGACA...TACCAATAGTCTTATTTTGGCTTATTTTTAA
## [2] 28369 CGGGGACACCCTGGCCACCGCCGCGCGGACA...TACCAATAGTCTTATTTTGGCTTATTTTTAA
width(comt)
## [1]  2437 28369
length(comt)
## [1] 2

From this code it can be seen that we have created an object of ShortRead type that contains a DNAStringSet containing two DNA sequences. The width() function reports the length of each DNA sequence while the length() function reports the number of DNA sequences in the sequence collection.

The character representation of the sequence remains accessible:

comtStr <- toString(sread(comt[1]))
class(comtStr)
## [1] "character"
substring(comtStr,1,50)
## [1] "CGGGGACACCCTGGCCACCGCCGCGCGGACACCCTCACGAGGACACCCCG"

Find all positions in the sequence with “ATG” codon:

gregexpr("ATG", comtStr)
## [[1]]
##  [1]  383  533  579  650  758  836  921  924  941 1000 1300 1337 1390 1441
## [15] 1705 1823 1888 1934 1944 2163 2247
## attr(,"match.length")
##  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## attr(,"useBytes")
## [1] TRUE
as.vector(gregexpr("ATG", comtStr)[[1]])
##  [1]  383  533  579  650  758  836  921  924  941 1000 1300 1337 1390 1441
## [15] 1705 1823 1888 1934 1944 2163 2247

There are a wide range of different functions that can be applied to the manipulation of individual DNA sequences. We could apply functions that include strsplit, replaceLetterAt, subseq, maskMotif, reverseComplement etc. These methods work well but for single sequences or for a small collection of sequences, for batch jobs, other software might be more suitable.

There are a number of other convenience utilities in R/Bioconductor:

GENETIC_CODE
## TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC 
## "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "*" "W" "L" "L" 
## CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG 
## "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "I" "M" 
## ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC 
## "T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" 
## GCA GCG GAT GAC GAA GAG GGT GGC GGA GGG 
## "A" "A" "D" "D" "E" "E" "G" "G" "G" "G"
IUPAC_CODE_MAP
##      A      C      G      T      M      R      W      S      Y      K 
##    "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT" 
##      V      H      D      B      N 
##  "ACG"  "ACT"  "AGT"  "CGT" "ACGT"