Module 8 : I/O, string manipulation and plyr package

This week we will be exploring the Input and Out capabilities of R, as well as the usefulness of the tools of the plyr package.
For this assignment we are given three different tasks:
1. We must import a text file with that contain a data set of Name, Age, Sex and Grade of 20 students and then using plyr package organize the data based on sex, and include a new column with the grade average of their sex. 
2. We need to take our data and output our data into a CSV(Comma Separated Value) format.
3. We will take the original data set and extract the data for only students that have the character "i" in their name. And then output a the results to a file in the CSV format.

First we need to input the text file into R. To do this I used the read.table() function. This is a specific function for files that are already in a table like format. I used the arguments header = TRUE and sep ="," since the format of the text file we are extracting has a header, and values are separated using ",". In this case we did not add the file to read so R will prompt us to choose a file to open manually. This can be useful for us when creating a package since this allows users to use files from anywhere and not just specific locations. 
I named the input we took to a variable named "rawfile":
>rawfile <- read.table(file.choose(),header =TRUE, sep=",")

Now if we call on rawfile we can see the text file we originally had is now imported into R as a data frame, neat!
We were introduced to plyr this week, an R package that builds off the default apply() function, but is more organized and also allows us to do some robust operations based upon it. 
But before we can use plyr we need to install it into R:
>install.packages("plyr")
>library(plyr)
The first use of plyr in our code is to use ddply(). The dd in ddply means that we are inputting a data frame and we want the output to be a data frame as well. This makes things a lot easier for users to simply look at the type of plyr function being run and know immediately what the input is and the expected output of the function!
Our goal is to split the sex in our data between male and female, and calculate the average of the sexes. We can do this with the following command:
>y = ddply(rawfile,"Sex",transform,Grade.Average= mean(Grade))
Here we are telling ddply to use the rawfile data set we extracted. That Sex is what we want split and that we want to "transform" (modify existing dataframe) to add a column that tells us the averages of the sexes. 

It is nice that we have this information in R, but it would be important for us export is as well. We can do this by simply using the write() function. In this case we will use with the .table method.
>write.table(y,"Sorted_Avg",sep = ",")

We are creating a file named Sorted Avg and we indicate that we want the values separated with a comma. This is important since this allows our output file to be classified as CSV, it will automatically include commas in between values for us. 
Note what since we don't indicate what file type this is, R will produce a plain file type. For instance we can include .txt at the end of our name and the write function will automatically create it will be created in the .txt format. I created it in the .txt format since that is the file type we started with.

Next we were instructed to extract only the students who what the letter "i" in their name. We do this by using the subset() function:
>new1 <-subset(rawfile,grepl("[iI]",rawfile$Name))
Here we tell it to look at our original rawfile, use the function grepl(), what this does it return an argument if a value has a pattern match we give it. In our case we tell grepl to check if the Names have an I or i. It is important to note we had to include both lowercase and uppercase since by default R sees them as different characters. We could technically get away with just "i" since none of the students have a name starting with the letter I, and therefore there were no instance of capital I, but it's better to be safe than sorry especially if we want to use this code for other data that may have the character I. Finally we tell the subset function to only look at the column names with $Name. 

One thing to note that is that the numbers next to the names(row names) are still based on the original data frame. This might look weird and cause some confusion. We can fix this by changing the row names back to the length of the list.
>rownames(new1) <- seq(length=nrow(new1))

Before:
After:

Finally we take this new subset and can create a file of this data similar like we did before with our previous custom table:
>write.table(new1,"DataSubsetI",sep=",")

Excellent!
As stated by Matloff in our reading, I/O is extremely important and is often overlooked. It was great learning these new skills and I know I will be implementing different reading and writing of files in my R journey.

-Anthony 

Comments

Popular posts from this blog

R Final project package: Introducing muMotif

Module 9 : Visualization