Chapter 3 Explore Data with R

3.1 Getting Started

We will be using these three datasets for the analysis today.. Please download the following files:

  1. messy_demographic.csv
  2. messy_cognitive.csv
  3. messy_genotype.csv

In this lesson, we are going to start building figures and table from our data.

We are going to do so inside an R - notebook, so we can write ourself a little tutorial/report as what we are finding as we go!!

To write a report, we will make use of R-Markdown syntax. A cheatsheet for R Markdown Syntax is here.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

We are also going to make heavy use of the “tidyverse” suite of packages. These packages include:

For more info on programming with the tidyverse I highly recommend the online book R for data science by Garrett Grolemund & Hadley Wickham.

3.1.1 The packages for today

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

3.1.2 reading in the data

demo_df <- read_csv("~/Desktop/messy_demographic.csv")
cog_df <- read_csv("~/Desktop/messy_cognitive.csv")
gene_df <- read_csv("~/Desktop/messy_genotype.csv")

3.1.3 copy and paste the cleaning code

These are all the things we learned to do in Intro to R.

We are going to put them in one big chunk here.

library(stringr)

demo_df[demo_df==""] <- NA
demo_df[demo_df=="missing"] <- NA
demo_df[demo_df=="9999"] <- NA
demo_df <- demo_df %>%
  mutate(age = as.numeric(age),
         ethnicity = factor(ethnicity),
         sex = factor(sex, levels = c(0,1), 
                      labels = c("Male", "Female")),
         dx = factor(dx, levels = c(0,1), 
                     labels = c("Control", "Case")))
cog_df[cog_df==""] <- NA
cog_df[cog_df=="missing"] <- NA
cog_df[cog_df=="9999"] <- NA
cog_df <- cog_df %>%
  mutate(cog1 = as.numeric(cog1),
         cog2 = as.numeric(cog2),
         cog3 = as.numeric(cog3),
         subject_ID = str_replace(subID, "subject", "SUB_")) %>%
  select(subject_ID, cog1:cog3)
gene_df[gene_df==""] <- NA
gene_df[gene_df=="missing"] <- NA
gene_df[gene_df=="9999"] <- NA
gene_df <- gene_df %>%
  mutate(genotype = factor(genotype,
                           levels=c(0,1,2), 
                           labels=c("AA","AG","GG")),
         subject_ID = str_replace(subID, "subject", "SUB_")) %>%
  select(-subID)
alldata <- demo_df %>%
  inner_join(cog_df, by="subject_ID") %>%
  inner_join(gene_df, by="subject_ID")

3.1.4 Let’s see what we have here

We can use the summary function (from base R) to get an idea of what is in our dataset.

summary will print some summary statistics for numeric variables are counts for our factors.

summary(alldata)
##   subject_ID             age            sex      ethnicity         dx     
##  Length:350         Min.   :22.00   Male  :130   AA   : 70   Control:166  
##  Class :character   1st Qu.:43.00   Female:216   As   : 42   Case   :178  
##  Mode  :character   Median :49.00   NA's  :  4   Cauc :196   NA's   :  6  
##                     Mean   :50.38                In   : 19                
##                     3rd Qu.:58.00                Other: 14                
##                     Max.   :89.00                NA's :  9                
##                     NA's   :8                                             
##       cog1             cog2            cog3           genotype  
##  Min.   : 2.868   Min.   :20.23   Min.   :  0.05796   AA  :103  
##  1st Qu.: 8.783   1st Qu.:29.22   1st Qu.:  8.82092   AG  :145  
##  Median :10.939   Median :31.36   Median : 19.79222   GG  : 94  
##  Mean   :11.087   Mean   :31.66   Mean   : 25.04099   NA's:  8  
##  3rd Qu.:12.964   3rd Qu.:34.03   3rd Qu.: 35.55916             
##  Max.   :22.189   Max.   :44.74   Max.   :104.21326             
##  NA's   :5        NA's   :5       NA's   :9

3.2 Using tableone to make Table 1 of your paper

Table one is a cool package that creates the demogaphics table.

It takes four useful arguments + data: the data to plot + vars: the variables (from your data) to include in your table + factorVars: a list of which variables (in vars) should be treated as factors + strata: the name of a variable to split the table by.

library(tableone)
CreateTableOne(alldata,
               vars = c("age", "sex",
                        "genotype","ethnicity",
                        "cog1", "cog2", "cog3"),
               factorVars = c("sex", 
                              "genotype","ethnicity"),
               strata = "dx")
##                   Stratified by dx
##                    Control       Case          p      test
##   n                  166           178                    
##   age (mean (SD))  50.42 (11.95) 50.57 (11.06)  0.905     
##   sex = Female (%)   127 (77.4)     86 (48.9)  <0.001     
##   genotype (%)                                  0.371     
##      AA               44 (27.2)     58 (33.1)             
##      AG               68 (42.0)     73 (41.7)             
##      GG               50 (30.9)     44 (25.1)             
##   ethnicity (%)                                 0.523     
##      AA               36 (22.4)     32 (18.4)             
##      As               17 (10.6)     25 (14.4)             
##      Cauc             92 (57.1)    100 (57.5)             
##      In               11 ( 6.8)      8 ( 4.6)             
##      Other             5 ( 3.1)      9 ( 5.2)             
##   cog1 (mean (SD))  9.94 (2.91)  12.13 (3.44)  <0.001     
##   cog2 (mean (SD)) 31.55 (3.88)  31.78 (3.79)   0.573     
##   cog3 (mean (SD)) 25.20 (21.76) 25.25 (20.39)  0.982

3.3 Research Question 1 (two group comparison)

3.3.0.1 Is performance on Cognitive Scale One (cog1) associated with Diagnosis (Dx)

To test this statistically, we are going to run an independant samples t-test, using the base t.test function.

When we call the t.test function, we are going to use “formula” notation. we’re our dependant variable goes on the left side of a ~ and the predictors go to the right i.e. y ~ x.

t.test(cog1 ~ dx, data = alldata)
## 
##  Welch Two Sample t-test
## 
## data:  cog1 by dx
## t = -6.347, df = 334.92, p-value = 7.133e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.871388 -1.512676
## sample estimates:
## mean in group Control    mean in group Case 
##              9.940047             12.132079

One cool thing to remember about everything you do in R is that they generate useful objects we can save.

my_ttest_result <- t.test(cog1 ~ dx, data = alldata)
my_ttest_result$p.value
## [1] 7.13299e-10

3.3.1 Plotting our result

library(ggplot2)

ggplot(data = alldata, aes(x = dx, y = cog1)) +
  geom_boxplot()

#### removing NA rows before plotting

We can use tidyr’s drop_na to remove rows with NA’s in the dx column

alldata %>% 
  drop_na(dx) %>%
  ggplot(aes(x = dx, y = cog1, fill = dx)) +
    geom_boxplot() 

3.3.1.1 Introducing the the dotplot geom

geom_dotplot has to be one of my favs.. Let’s use it in combination with geom_boxplot.

Let’s also use the labs() call to relabel our axes with more descriptive titles.

Note: by setting labs(x = NULL) we are removing the word “dx” from the bottom of the plot

alldata %>% 
  drop_na(dx) %>%
  ggplot(aes(x = dx, y = cog1, fill = dx)) +
    geom_dotplot(binaxis = "y", stackdir = "center") +
    geom_boxplot(alpha = 0.5) +     # using "alpha" to make the box-plot semi-transparent
    labs(x = NULL,                  #remove dx from the bottom
         y = "Cognitive Score 1",   #add more desciptive title to y-axis
         fill = "Diagnosis")        # change dx to Diagnosis in the legend

3.4 pretty tables with dplyr

In the section below we use two powerful tools from dplyr:

  • summarise: will calculate summary statistics, we are going to ask for three stats here:
    • n_age: the number of valid (i.e. not NA) values for age
    • mean_age: the mean for age (after removing NA values using the na.rm argument)
    • sd_age: the standard deviation for age after removing the NA values
  • group_by: will split the data by the “grouping” variable(s) you input
    • in this case we are asking for separate summary statistics for each diagnostic group, for each genotype

After calculating our summary table we are going to call kable, a function in the knitr package. That make the table appear nicer in our “knitted” report.

Note: kable has some kewl optional arguments you can use to format you table further, such as col.names, digits, and align. Check out ?kable for more info..

table_2 <- alldata %>%
  drop_na(dx) %>%
  group_by(dx,genotype) %>%
  summarise(n_age = sum(!is.na(age)), #the total number of observations that are NOT NA
            mean_age = mean(age, na.rm = T),
            sd_age = sd(age, na.rm = T))
## Warning: Factor `genotype` contains implicit NA, consider using
## `forcats::fct_explicit_na`
library(knitr)
kable(table_2)
dx genotype n_age mean_age sd_age
Control AA 43 52.00000 10.574002
Control AG 66 49.92424 11.024882
Control GG 50 49.94000 14.217566
Control NA 4 47.50000 12.476645
Case AA 57 55.08772 10.554756
Case AG 71 48.61972 10.355628
Case GG 43 48.37209 11.176129
Case NA 2 38.00000 9.899495

3.5 use forcats to label risk_carriers

When working with a genotype of interest, it’s very common group participants into “carriers” and “non-carriers” of a risk-allele. In this dataset, our risk allele is the “G” genotype. We will create a new “risk_carrier” variable by combining levels of the genotype variable.

Working with factors in R can be…annoying. Especially when we want to create new factors out of old factors. For this reason, the tidyverse includes a forcats package. That includes many function (all starting with fct_) for doing stuff to factors. We will use fct_collapse to combine the “GG” and “AG” genotype factors into a new “carrier” level in a new factor.

library(forcats)
alldata <- alldata %>%
  mutate(risk_carrier = fct_collapse(genotype,
                                     carrier = c("GG", "AG"),
                                     non_carrier = "AA"))

3.6 Make a simple scatter plot

Let’s use geom_point() and geom_smooth, together to create a scatter plot with a trendline.

Note: to make the trendline straight, we use “method = lm” when calling geom_smooth().

alldata %>%
  ggplot(aes(x = age, y = cog1)) +
    geom_point() + 
    geom_smooth(method = "lm") 

3.7 Use plotting to show the age by risk carrier interaction

When we add a new mapping of color = risk_carrier to the top line of our plot this mapping applies to both the points and the trendline.

Note: we are using drop_na to remove the NA as a category so that it does not occur as a third color in our plot.

alldata %>%
  drop_na(risk_carrier) %>%
  ggplot(aes(x = age, y = cog1, color = risk_carrier)) +
    geom_point() + 
    geom_smooth(method = "lm") 

We can use faceting to add and extra dimension - separate plots for male and female subjects.

alldata %>%
  drop_na(risk_carrier,sex) %>%
  ggplot(aes(x = age, y = cog1, color = risk_carrier)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    facet_wrap(~sex)

facet_grid will allow us to facet by TWO variables. In this case sex and dx.

alldata %>%
  dplyr::filter(!is.na(risk_carrier), !is.na(sex), !is.na(dx)) %>%
  ggplot(aes(x = age, y = cog1, color = risk_carrier)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    facet_grid(dx~sex)

3.8 BONUS section - gather cognitive scores into one plot

One things we want to do sometimes is to stack data from multiple columns in one column. The reason why this is useful, will hopefully become apparent in the next example.

This “stacking” task (also refered to as “melting”) is accomplished using gather from the tidyr package. gather creates two new columns from your data: “key” and “value”. + key: a new column that will hold the old variable names from the gathered data + value: a new column that will hold the data values

After the key and value arguments. We can tell gather what colums we want to stack using the same syntax used by dplyr’s select(). In this example, we tell it we want to gather our three cognitive scales using starts_with().

library(tidyr)
alldata %>% 
  gather(cog_scale, cognitive_score, starts_with("cog"))
## # A tibble: 1,050 x 9
##    subject_ID   age sex   ethnicity dx    genotype risk_carrier cog_scale
##    <chr>      <dbl> <fct> <fct>     <fct> <fct>    <fct>        <chr>    
##  1 SUB_1         43 Male  Cauc      Cont… GG       carrier      cog1     
##  2 SUB_2         47 Fema… Cauc      Case  AG       carrier      cog1     
##  3 SUB_3         69 Fema… Cauc      Case  AA       non_carrier  cog1     
##  4 SUB_4         51 Male  Cauc      Case  GG       carrier      cog1     
##  5 SUB_5         52 Fema… Cauc      Cont… AA       non_carrier  cog1     
##  6 SUB_6         71 Male  AA        Case  AA       non_carrier  cog1     
##  7 SUB_7         56 Fema… Cauc      Case  AA       non_carrier  cog1     
##  8 SUB_8         35 Fema… <NA>      Cont… GG       carrier      cog1     
##  9 SUB_9         42 Fema… Cauc      Cont… AG       carrier      cog1     
## 10 SUB_10        45 Fema… Other     Case  <NA>     <NA>         cog1     
## # … with 1,040 more rows, and 1 more variable: cognitive_score <dbl>

The beauty of gather is that it can be combined with other the rest of the tidyverse using the pipe. Let’s feed out gathered result to ggplot.

Note: using scales = "free" when faceting will force ggplot to draw a separate axis for each subplot (by default, scales = "fixed", meaning that the same axes are drawn for each plot).

alldata %>%
  gather(cog_scale, cognitive_score, starts_with("cog")) %>%
  drop_na(risk_carrier, dx) %>%
  ggplot(aes(x = age, y = cognitive_score, color = risk_carrier)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    facet_wrap(~cog_scale, scales = "free")