R Final project package: Introducing muMotif
Welcome!
Greetings and welcome to the write up of muMotif package.
Github link to muMotif: https://github.com/Ant-nguyen/RmuMotif
This was a final project for our R programming course, the goal of this package is to provide some tools that can help one look up motifs in DNA sequences. Motifs can be thought as patterns or hidden messages that standout in DNA. Motifs can be useful for narrowing down on regions of interest in large DNA sequences.
Much of the logic and principles of motif search were inspired by:
Throughout this blog I hope to explain the concepts and though processes behind the package, if you find some of these concepts interesting or if my explanation is lacking I highly recommend the link above.
Let us begin!
Quick Tour:
MuMotif is a package of functions that help allow one to find motifs in strings of DNA. The package is also set up to allow for quick visualization of some of the results.
There are two broad sets of functions:
1. Functions that help generalize and describe DNA sequences and/or motifs.
2. Functions that run quick ("greedy")algorithms to produce motifs from sequences of DNA.
The visualization of this package all rely on the packages ggplot2 and the add-on ggseqlogo.
For simplification all visualization are perform by methods off the general plot() function. Products from certain function will automatically classify the object with the appropriate Class attribute to allow the plot() function to automatically visualize the results if needed. This will further be explored in examples of these methods but here is a sneak peak of the visualization that can be produced:
Example of results from using plot() on an object with the
profile class attribute.
1.DosR_Dataset: These are 10 genes that express significant changes during hypoxic condition in tuberculosis.
2.simpleSampleMotif: A vector of 5 DNA motifs that are all 12 sequences long.
simpleSampleMotif is small and manageable enough to help showcase how the function works while
DosR_Dataset is intended to show how the package might work with real DNA sequences.
Functions
The functions will be ordered in levels of complexity. Some functions heavily rely on output of other functions, hopefully by ordering it this way less confusion will occur.
patternCount(patt,txt)
patternCount <-function(patt,txt){
pattlocation <-gregexpr(patt,txt)[[1]] # int vector of where each pattern starts
return(length(pattlocation))
}
Simply returns the number of times a pattern (patt)occurs in a text string (txt).
Example:
> patternCount("AGC","GTGCAGCTAGC")
[1] 2
> patternCount("G","GTGCAGCTAGC")
[1] 4
frequencyMap(k,txt)
frequencyMap <- function(k,txt){ # k is an integer value representing a length of sequence (kmer)
freq <- list()
n <- nchar(txt)-2
for(i in 1:n){
x <- (k+i-1)
pattern <- substr(txt,i,x)
freq[[pattern]] <- patternCount(pattern,txt)
}
class(freq) <-"frequencyMap" #classify end result to frequencyMap, allowing specific plotting
return(freq)
}
This function returns a list of all the snippets of a text string (txt) of a specific length (k).
These snippets are often called kmers, for instance kmers that are 3 nucleotides long are known as 3-mers and those that are 9 are known as 9-mers.
The function relies on the patternCount function to return how often these kmers appear.
This function also highlights the first of the unique classes "frequencyMap". Running plot() on result of this function will produce a bar graph of the kmers.
Example:
> frequencyMap(3,"GTGCAGCTAGC")
$GTG
[1] 1
$TGC
[1] 1
$GCA
[1] 1
$CAG
[1] 1
$AGC
[1] 2
$GCT
[1] 1
$CTA
[1] 1
$TAG
[1] 1
attr(,"class")
[1] "frequencyMap"
plot.frequencyMap(frequencyMap)
plot.frequencyMap <- function(frequencyMap){
library(ggplot2)
datas <- unlist(frequencyMap)
dfmap <- data.frame(datas)
ggplot(dfmap,aes(rownames(dfmap),datas))+geom_col()+xlab("kmer sequence")+ylab("Frequency")
}
This method function function simply takes the frequency map that was generated from frequencyMap() and return a bar graph like this:
The function uses ggplot2 so it requires importing of that function.
Example:
> m <- frequencyMap(3,"GTGCAGCTAGC")
> plot(m)
muCount(Motifs,pseudo=FALSE)
muCount <- function(Motifs,pseudo=FALSE){
k <-nchar(Motifs[1])
count <- genematrix(k,pseudo = pseudo)
for(i in Motifs){
nuc <- strsplit(i,"")[[1]]
for(j in 1:k){
count[nuc[j],j]<- count[nuc[j],j]+1
}
}
return(count)
}
This function takes a vector of motifs and return a matrix of the count of each nucleotide at each position. The pseudo argument can be turned on to true in order to use a concept known as pseudo counts. What this does is start all counts with 1 instead of 0. This will be useful later on when calculating sequence probability and scoring.
Also this function uses a simple small function that generates matrices in a format for the sequences.
genematrix()the function takes k and pseudo=T/F. K being the size of the matrix(sequence length)
Example:
> simpleSampleMotif
[1] "GGCGTTCAGGCA" "AAGAATCAGTCA" "CAAGGAGTTCGC" "CACGTCAATCAC" "CAATAATATTCG"
> muCount(simpleSampleMotif)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
A 1 4 2 1 2 2 1 4 0 0 1 2
C 3 0 2 0 0 1 2 0 0 2 3 2
G 1 1 1 3 1 0 1 0 2 1 1 1
T 0 0 0 1 2 2 1 1 3 2 0 0
> muCount(simpleSampleMotif,pseudo = TRUE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
A 2 5 3 2 3 3 2 5 1 1 2 3
C 4 1 3 1 1 2 3 1 1 3 4 3
G 2 2 2 4 2 1 2 1 3 2 2 2
T 1 1 1 2 3 3 2 2 4 3 1 1
muProfile(Motifs,pseudo=FALSE)
muProfile <- function(Motifs,pseudo=FALSE){
if (pseudo== FALSE){
counts <- muCount(Motifs)
n <- length(Motifs)
}
else{
counts <- muCount(Motifs,pseudo=TRUE)
n <- length(Motifs)+4
}
profile <- (counts/n)
class(profile) <- "profile" #classification to allow for specific plot method
return(profile)
}
muProfile turns the muCount into ratio, again pseudo function is available in order to use pseudo counts if needed.
The function also assign a the class "profile", this allow for plot method plot.profile().
Example:
> muProfile(simpleSampleMotif)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
A 0.2 0.8 0.4 0.2 0.4 0.4 0.2 0.8 0.0 0.0 0.2 0.4
C 0.6 0.0 0.4 0.0 0.0 0.2 0.4 0.0 0.0 0.4 0.6 0.4
G 0.2 0.2 0.2 0.6 0.2 0.0 0.2 0.0 0.4 0.2 0.2 0.2
T 0.0 0.0 0.0 0.2 0.4 0.4 0.2 0.2 0.6 0.4 0.0 0.0
attr(,"class")
[1] "profile"
> muProfile(simpleSampleMotif,pseudo = TRUE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A 0.2222222 0.5555556 0.3333333 0.2222222 0.3333333 0.3333333 0.2222222 0.5555556 0.1111111 0.1111111
C 0.4444444 0.1111111 0.3333333 0.1111111 0.1111111 0.2222222 0.3333333 0.1111111 0.1111111 0.3333333
G 0.2222222 0.2222222 0.2222222 0.4444444 0.2222222 0.1111111 0.2222222 0.1111111 0.3333333 0.2222222
T 0.1111111 0.1111111 0.1111111 0.2222222 0.3333333 0.3333333 0.2222222 0.2222222 0.4444444 0.3333333
[,11] [,12]
A 0.2222222 0.3333333
C 0.4444444 0.3333333
G 0.2222222 0.2222222
T 0.1111111 0.1111111
attr(,"class")
[1] "profile"
plot.profile(profile,prob=TRUE)
plot.profile <- function(profile,prob=TRUE){
bit <- ggseqlogo(profile)
pro <- ggseqlogo(profile,method = 'prob')
if (prob == TRUE){
return(pro)
}
else{
return(bit)
}
}
Another plot method, this time for objects classified as "profile".
As long as the object is a profile, plot() function will automatically run the method producing sequence logo plots.
prob argument if left TRUE will plot the seqlogo to scale based on probability, if FALSE will scale based using a weight system.
This method heavily rely on the ggseqlogo package found here:
Examples:
consensus(Motifs)
consensus <- function(Motifs){
profile <- muCount(Motifs)
conStrand <- vector(mode='character')
for (i in 1:nchar(Motifs[[1]])) {
conStrand <- c(conStrand,names(which.max(profile[,i])))
}
return(conStrand)
}
Given motifs will produce the consensus at each position.
Example:
> consensus(simpleSampleMotif)
[1] "C" "A" "A" "G" "A" "A" "C" "A" "T" "C" "C" "A"
muScore(Motifs)
muScore <- function(Motifs){
count <- muCount(Motifs)
score <- sum(count)
for (i in 1:nchar(Motifs[[1]])) {
score<- (score - max(count[,i]))
}
return(score)
}
A scoring function, given motifs will compare the motifs with the consensus. Gives a score based on how well the consensus of the motifs compare to each motif. The lower the score, the more the consensus of the group of motifs compare to one another.
Example:
> muScore(simpleSampleMotif)
[1] 28
pr(txt,profile)
pr<- function(txt,profile){
p <- 1
txtstr <- strsplit(txt,"")[[1]]
for(i in 1:length(txtstr)){
p <- profile[txtstr[i],i]*p
}
return(as.numeric(p))
}
Simple function that takes a sequence of DNA (txt) and based on the profile provided return a probability value of the sequence happening based on the profile.
*Note: based on how the probability is calculated this is where pseudo counts are useful. Since the probability is calculated by multiplying the nucleotide ratios(profile). Any sequence that has a single nucleotide in a position that never occurs will have the same probability of a sequence that has many non-occurring sequence 0% . With pseudo counts we can differentiate the sequences that have a nucleotide in position that never occur in a profile from one another.
Example:
> pr("AACGTA",sampleProfile)
[1] 0.006144
profileMostProbable(txt,k,profile)
profileMostProbable <- function(txt,k,profile){
num <- nchar(txt)
bestscore <- -1
for(i in 1:(num-k+1)){
x <- substr(txt,i,(i+k-1))
if(pr(x,profile) > bestscore){
bestscore <- pr(x,profile)
mostprob <- x
}
}
return(mostprob)
}
This function will produce the most probable kmer based of a sequence based on a profile provided.
The algorithm is simple so it will only take the first best kmer that is produced.
Examples:
> profileMostProbable("AACGTA",3,sampleProfile)
[1] "AAC"
GreedyMotifSearch(DNAseqs,k,t,pseudo=FALSE)
GreedyMotifSearch <- function(DNAseqs,k,t,pseudo=FALSE){
BestMotifs <- c()
for(i in DNAseqs){
BestMotifs <-c(BestMotifs,substr(i,1,k))
}
n <- nchar(DNAseqs[1])
for(j in 1:(n-k+1)){
Motifs <- substr(DNAseqs[1],j,(j+k-1))
for(m in 2:t){
p <- muProfile(Motifs,pseudo=pseudo)
Motifs <- c(Motifs,profileMostProbable(DNAseqs[m],k,p))
}
if(muScore(Motifs)< muScore(BestMotifs)){
BestMotifs <- Motifs
}
}
return(BestMotifs)
}
Finally the "big" algorithm of the package. The GreedyMotifSearch essentially Generates a vector of motifs from a series of DNAseqs. The size of the motifs can be changed with k and the number of DNAseqs used can be manipulated with t if all the DNAseqs are not wanted for whatever reason.
Note that this algorithm is very simple and therefore has some inherent biases. But due to its simplicity it is fast and does a decent enough job to at least getting the ball rolling when searching for Motifs.
A decent breakdown of the function can be found here if interested:
Example:
> GreedyMotifSearch(DosR_Dataset,15,10)
[1] "GTTAGGGCCGGAAGT" "CCGATCGGCATCACT" "ACCGTCGATGTGCCC"
[4] "GGGTCAGGTATATTT" "GTGACCGACGTCCCC" "CTGTTCGCCGGCAGC"
[7] "CTGTTCGATATCACC" "GTACATGTCCAGAGC" "GCGATAGGTGAGATT"
[10] "CTCATCGCTGTCATC"
> GreedyMotifSearch(DosR_Dataset,15,10,pseudo = TRUE)
[1] "GGACTTCAGGCCCTA" "GGTCAAACGACCCTA" "GGACGTAAGTCCCTA" "GGATTACCGACCGCA"
[5] "GGCCGAACGACCCTA" "GGACCTTCGGCCCCA" "GGACTTCTGTCCCTA" "GGACTTTCGGCCCTG"
[9] "GGACTAACGGCCCTC" "GGACCGAAGTCCCCG"
Conclusions
I hope this blog help those viewing muMotif.
Thank you for taking the time to check out the package.
There are many things I hope to add and change with the package, if interested feel free to contact or message me on issues or ideas about the package.
-Anthony Nguyen
Comments
Post a Comment