Module 7: Object Oriented Programming

This week we will be exploring object oriented programming (OOP).

GitHub Link: https://github.com/Ant-nguyen/Intro_r_2021/blob/main/Module7.R

Our assignment this week is to obtain any type of data and then try to determine what generic functions can be assigned to the data set. Then see how we can utilize S3 and S4 class structures and OOP paradigms in the data.

This week I decided to challenge myself a bit and use a slightly unconventional type of data, a sequence of genomic data. The goal being to try to utilize OOP methods when tackling a sequence of DNA.

For those who are unfamiliar with DNA transcription I will be covering some of the basics through out but the following video is a simple to understand explanation of some of the concepts I will be dealing with:

Link to DNA data: https://www.ncbi.nlm.nih.gov/nuccore/JX262162

Raw genomic sequence: https://www.ncbi.nlm.nih.gov/nuccore/JX262162.1?report=fasta

Let us begin

First let me begin by explaining what our data is.

The data that I will be using is the complete genome of the Human polyomavirus 10(HPV10). It is 4,939 base pair long. In essence the genome is raw data. In the genome there are sequences of DNA that code specific proteins of interest, these I will treat as objects in our OOP example.

To input the data into R I will take the raw sequence that is presented as a string of letters ("agtc") and save to a text document named HPV-10.txt. Important to note that this file should be in the same directory as R.

Next, I have R read the text file as a string of characters with the function readChar(), and set the string of characters to variable HPV10.

readChar(), has two important argument for us, (1) The txt file name,(2) the number of characters. I used file.info(fileName)$size as the second argument, this trick I found (see references) uses a built-in method allowing us to return the file size of HPV-10.txt as our second argument for the number of characters we wish extracted from the txt file.

This represents our first two line of code:

>fileName <- 'HPV-10.txt' #Text file with the full HPV10 genome

>HPV10 <- readChar(fileName,file.info(fileName)$size)

When we check to see what variable HPV10 consist of we see it is the full genome exactly what we need.

Due to the unconventional nature of our data some generic functions don't produce results like other data sets. We can still use print() to see our data, and summary() works but doesn't really tell us much useful information. Functions like plot for instance show no results.

One function that I found be very useful when dealing with our data is substr(). This function allows us to extract a substring of characters from an existing larger string of characters, you can have this function return specific range of the string of characters. This is useful for us because, using the full HPV10 genome we can focus on specific regions where proteins of interest are coded.

Here is an example:

In the full HPV10 genome there is a region that codes for a minor capsid protein between nucleotide 443 and 1375.

We can view this specific region by using the substr() function:

>substr(HPV10,443,1375)

Awesome!

Using S3 and S4 in our data.

Now that we have all of that out of the way I will now begin to explore how we can use OOP to organize the data.

As hinted earlier before I want us to think of the specific sequences of genetic data as objects. The different classes of objects will tell what type of genetic data is being presented. For this to make the most sense I would recommend watching the video posted above, but I will attempt to break down what I'm trying to do to the bare essentials for understanding.

Essentially DNA has two strands each going anti-parallel directions (one going 5' to 3', the other 3' to 5') DNA is presented in the 5' to 3' direction, this known as the coding stand, while 3' to 5' is known as the template strand. Each strand tell us different things, and by organizing them we can extract specific information or create dynamic functions that return different information depending on the class of the strand.

The scope of this can get very expansive, so right now to keep it simple I will simply make two classes for coding strand and template strand. Then I will make two different print method based on the type of class. If the class type is coding strand, the print function will indicate the the 5' and the 3' accordingly, and vise versa for the the template strand.

We can expand even further and have classes for mRNA and tRNA, but again I want to keep the scope reasonable for this assignment. Presenting more of a proof of concept instead of a full fledge product.

(For instance if an object is a template strand, a function that would give us the resulting amino-acid chain of the protein would have to use different logic than for the coding strand.)

S3 classes

First we will use the S3 class paradigm.

Through out this example we will be using the minor capsid protein sequence as our object of interest.

In our class I want to showcase a few different attributes.

1. The DNA sequence

2. The length of the sequence

3. The direction (is 5'-3' or 3'-5'?)

Here is our code:

>capsid <- list(DNAseq= substr(HPV10,443,1375),size = nchar(substr(HPV10,443,1375)),Fiveto3 = TRUE)

>class(capsid) <-"coding"

The function nchar() counts the number of characters in a string.

Now when we examine the object capsid, it will tell us the DNA seq, it's length and the direction of the strand.

I decided it would be better to use a constructor function to make our classes since it would take less lines of code to enter and reduce possible input error:

>Coding <- function(DNA){

cod <- list(DNAseq = DNA, size = nchar(DNA), Fiveto3 = TRUE)

class(cod) <- "coding"

return(cod)

}

>Template <- function(DNA){

temp <- list(DNAseq = DNA, size = nchar(DNA), Fiveto3 = FALSE)

class(temp) <- "template"

return(temp)

}

This constructor is useful since, now we can quickly organize if a strand is Coding or Template with the constructor function.

The class organization allows us wield the data more reasonably. We can simply use the variable capsid instead to represent the sequence of genetic code.

When we print the code I want the code to show the direction of the DNA. Also although our raw data had everything lower case, I feel that it would better to represent the nucleotides with uppercase letters since that is the standard representation and often other DNA programs use uppercase lettering.

>print.coding <- function(item){

cat("5\'",toupper(item$DNAseq),"3\'")

}

>print.template <- function(item){

cat("3\'",toupper(item$DNAseq),"5\'")

}

Now when we print objects, depending on the class of the genetic sequence it will present the sequence all capitalize with the toupper() function, and show the orientation of the sequence (if 5'-3' or 3'-5').

Example:

We created capsid earlier as a with the "coding" strand class.

So when we simply call capsid we get:

As you can see the beginning and end are labeled and all the characters are capitalized.

S4 classes

Now when dealing with S4 classes there are a few glaring differences between S3 and S4.

For one creating classes are more "formal". Instead of taking an object and assigning a class and then assigning the attributes separately and in a more modular manor, In S4 we have to define a class and all its attributes(representation) first setClasss() and then assign the class to an object new(). The pro of this method is that since the class and attributes are predefined, it is less prone to errors that the open nature of S4 can lead. We can't accidently leave out class attributes or accidently create whole new classes with the S4 method.

So I begin by creating my two classes using the S4 specific function functions:

>setClass("coding",representation(DNAseq = "character",size = "numeric",Fiveto3 ="logical"))

>setClass("template",representation(DNAseq = "character",size = "numeric",Fiveto3 ="logical"))

We can see that S4 also has us predetermine the base type of the data stored in an object. This also makes it less prone for error and more rigid. Unlike in S3 where the data type is automatically assumed.

We can still check the base type of the object by running a function like typeof() for any individual attributes or object we are interested in. The following function is also a great way of telling which class type a coding is using. For S4 when using typeof() on an object it will say S4, while for S3 it will be considered a list. And then we can use attributes() to see what class the object is in.

It is important to note that when dealing with S4, there a different notations than S3. For instance to view the different attribute like representations of a class we use '@' instead of the usual '$'. Another difference is the function that presents the content of a S4 class is show()not the usual print(). This also mean that when we want to customize a method of a function we have to keep this in mind. For when changing the default presentation of objects with the "coding" strand class, we will have to add a method to show() not print:

>setMethod("show","coding",function(object){ cat("5\'",toupper(object@DNAseq),"3\'")})

Conclusion

Hopefully my post wasn't too confusing, I was hoping to challenge myself this time by thinking more about the specific field I'm hoping to use R in. I can see how using OOP it can lead me to think in more organized saws of how to approach a specific goal. I feel that the security of an S4 class system has a lot of benefits that can lessen possible errors from occurring and allowing for a less error prone user experience. Depending on how modular the package is intended to be perhaps an S3 system may be easier to manipulate. It also seems that by utilizing a constructor function one can minimize many of the inherent errors S3 has. Also S4 can dispatch on multiple arguments! This I can already see being very useful, for instance instead of my current organization of coding strand and template strand. I can organize between DNA and RNA instead, and have an argument that demonstrate if a strand is coding or template. This would definitely reduce some of the redundancy since a lot of the general functions I would want to do with the two are the same.

These are exciting ideas I hope to further explore.

Anthony

References:

1. https://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string

2. https://stackoverflow.com/questions/6450803/class-in-r-s3-vs-s4#:~:text=S3%20can%20only%20dispatch%20on,can%20dispatch%20on%20multiple%20arguments.

Search This Blog

R-programing Journey 2021