This document is an extension and explanation of the R exercises in the book Genomics with R for Dr. Wiggins’ Introductory Genomics course at Oklahoma State University.
The Genomics with R book contains critical information, explains the statistics used, and provides the exercises.
This document is intended as a companion to the Genomics with R book for students who are R beginners.
These instructions are for RStudio. In this course we are running RStudio through the Cyverse platform
This section begins with 4.1 Clustering: Grouping samples based on their similarity
package::function() structureexcept in the case of base and utils commands. This provides you with information about which package each function comes from. It makes code much more understandable and repeatable.
4.1.1 Distance metrics
In this section we use a data frame of patient gene expressions but how to create this table is not explained. Also, the data frame is called df this is problemmatic because tehre is a base R function df (the density of the F distribution). We will do the following
Name the data frame exdf for (gene)‘expression data frame.’
create the data frame with the following code. data.frame is the command that tells R what to do with what you put in the parentheses next. ‘Patient’, ‘IRX4’, ‘OCT4’, and ‘PAX6’ are the column names the information inside c() is what will go in the rows
exdf <- data.frame(Patient <- c("Patient1", "Patient2", "Patient3", "Patient4"), IRX4 <- c(11, 13, 2, 1), OCT4 <- c(10, 13, 4, 3), PAX6 <- c(1, 3, 10, 9))
look at the data frame
View(exdf)
Notice to the left <- of “Patient 1, Patient 2” etc are numbers. In R these are what is called the index.
We need to make the Patient designations the index.
exdf2 <- exdf[-1]
row.names(exdf2) <- exdf$Patient
View(exdf2)
Now the patien ID’s are the index
Calcluate the distance metrics
stats::dist(exdf2, method = "manhattan")
## Patient1 Patient2 Patient3
## Patient2 7
## Patient3 24 27
## Patient4 25 28 3
stats::dist(exdf2, method = "euclidean")
## Patient1 Patient2 Patient3
## Patient2 4.123106
## Patient3 14.071247 15.842980
## Patient4 14.594520 16.733201 1.732051
the correlation distance
as.dist(1-cor(t(exdf2)))
## Patient1 Patient2 Patient3
## Patient2 0.004129405
## Patient3 1.988522468 1.970725343
## Patient4 1.988522468 1.970725343 0.000000000
4.1.1.1 Scaling before calculating the distance
The book scales the data frame directly. This reduces the number of variables you have floating around in your environment but sometimes it might be worth it to create a new object so you don’t overwrite your data frame, in case you need to return to it later. Above, were I working with these data and not demonstrating I would have overwritten exdf rather than creating exdf2 but I wanted both available for comparison.
scaled_exdf2 <- scale(exdf2)
to view the distance matrix of the scaled data frame
#create an object to hold the scaled distance matrix
d <- stats::dist(scaled_exdf2)
#do hierarchical clustering
hc <- stats::hclust(d, method = "complete")
#plot the results
plot(hc)