Tidyverse

  • Family of packages developed to streamline graphing, data manipulation, data wrangling, and programming
  • Main packages of interest for us today: ggplot2, dplyr, tidyr (these have evolved to the point that they have replaced packages we converted in the r class: reshape2 and plyr)
  • These packages can work on a modern type of data frame called a tibble
  • ggplot2 is for graphing
  • dplyr is for data manipulation (think excel pivot tables on steroids)
  • tidyr is for data wrangling (long to wide format and back again)
  • These packages can use piping (in dplyr which calls purrr) which makes it easier to string together commands

Wrangling

  • tidyr makes data tidy
  • 2 workhouse functions
    • gather: Makes wide data long
    • spread: Makes long data wide
  • 2 supplemental functions
    • separate/unite: breaks/merges columns into multiple/one column
    • extract: takes groups within a column to make new columns

Example Dataset

  • Longitudinal data of trust over 5-time points (game plays) with the within-subject variable of vision
  • Do you trust your partner in the game of “trouble” when you can or cannot see them
  • Covariate (how “judging” personality type they are 1-3 scale)
  • Data is in typical wide SPSS-like format
  • Download Data
TrustWide<-read.csv("Mixed/TrustWideData.csv")
Subject NoSee_0 NoSee_1 NoSee_2 NoSee_3 NoSee_4 See_0 See_1 See_2 See_3 See_4 Personality
1 0.1301 0.1504 0.1209 0.1216 0.1525 0.295 0.44 0.475 0.42 0.265 0.0222995
2 0.2051 0.2004 0.1959 0.2316 0.2575 0.460 0.59 0.660 0.66 0.530 0.6669353
3 0.0601 0.0604 0.1309 0.2316 0.3425 0.430 0.61 0.720 0.77 0.750 0.7457592
4 0.2101 0.2404 0.2209 0.2416 0.2425 0.435 0.57 0.625 0.59 0.425 0.8527015
5 0.1451 0.1304 0.1359 0.1716 0.1975 0.420 0.54 0.610 0.60 0.490 0.6066250
6 0.1101 0.1404 0.1909 0.2616 0.3625 0.440 0.57 0.640 0.67 0.650 2.1633908
  • This is not so useful for taking means and doing stats, so we will convert it to long format

Convert to Long

  • Process in words:
    • We will use piping %>% to help us pass the dataset along to each function (so you ignore when the function asks for dataframe first)
    • gather(new variable name, new value name, variables to merge)
library(tidyr)
library(dplyr)
TrustLong<-TrustWide %>% gather(Condition, TrustFeeling,NoSee_0:See_4)
str(TrustLong)
## 'data.frame':    120 obs. of  4 variables:
##  $ Subject     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Personality : num  0.0223 0.6669 0.7458 0.8527 0.6066 ...
##  $ Condition   : chr  "NoSee_0" "NoSee_0" "NoSee_0" "NoSee_0" ...
##  $ TrustFeeling: num  0.1301 0.2051 0.0601 0.2101 0.1451 ...
Subject Personality Condition TrustFeeling
1 0.0222995 NoSee_0 0.1301
2 0.6669353 NoSee_0 0.2051
3 0.7457592 NoSee_0 0.0601
4 0.8527015 NoSee_0 0.2101
5 0.6066250 NoSee_0 0.1451
6 2.1633908 NoSee_0 0.1101
  • Personality applies to the subject, so every time we see the subject names, we need their Personality score
  • Our two variables (Vision & Time) are all stuck together. Good for SPSS RM anova, but not helpful for us.
    • separate(variable to split, c(“what do we call them”), what separates them, convert them to back to numeric/integers if they are numbers)
TrustLong.Final<-TrustLong %>% separate(Condition,c("Vision","Time"),sep="_", convert = TRUE)
Subject Personality Vision Time TrustFeeling
1 0.0222995 NoSee 0 0.1301
2 0.6669353 NoSee 0 0.2051
3 0.7457592 NoSee 0 0.0601
4 0.8527015 NoSee 0 0.2101
5 0.6066250 NoSee 0 0.1451
6 2.1633908 NoSee 0 0.1101
  • You can do this all at once!
TrustLong.Final<-
  TrustWide %>% 
  gather(Condition, TrustFeeling,NoSee_0:See_4) %>% 
  separate(Condition,c("Vision","Time"),sep="_", convert = TRUE)

Convert back to wide

  • You just have to inverse the process. First, unite the variables into one column and then spread them out
TrustWide.Again<-
  TrustLong.Final %>% 
  unite(Condition,c("Vision","Time"),sep="_") %>% 
  spread(Condition, TrustFeeling)
Subject Personality NoSee_0 NoSee_1 NoSee_2 NoSee_3 NoSee_4 See_0 See_1 See_2 See_3 See_4
1 0.0222995 0.1301 0.1504 0.1209 0.1216 0.1525 0.295 0.44 0.475 0.42 0.265
2 0.6669353 0.2051 0.2004 0.1959 0.2316 0.2575 0.460 0.59 0.660 0.66 0.530
3 0.7457592 0.0601 0.0604 0.1309 0.2316 0.3425 0.430 0.61 0.720 0.77 0.750
4 0.8527015 0.2101 0.2404 0.2209 0.2416 0.2425 0.435 0.57 0.625 0.59 0.425
5 0.6066250 0.1451 0.1304 0.1359 0.1716 0.1975 0.420 0.54 0.610 0.60 0.490
6 2.1633908 0.1101 0.1404 0.1909 0.2616 0.3625 0.440 0.57 0.640 0.67 0.650

Data Manipulation

  • dplyr lets you calculate means, sd or any other statistic you may want on the data based on how the data was wrangled.
  • 3 useful functions
    • group_by (how to cut data)
    • filter (subset on the fly)
    • summarise (what descriptive to conduct)
    • do (do more complex stats) [with the help of broom package which converts regressions to tibbles]
    • mutate (add stats back to data frame on the fly)

Group and summarise

  • We can calculate means and sd per group
Means<-TrustLong.Final %>%
  group_by(Vision,Time) %>%
  summarise(MeanTrust=mean(TrustFeeling),
            SDTrust=sd(TrustFeeling))
Vision Time MeanTrust SDTrust
NoSee 0 0.1451000 0.0596200
NoSee 1 0.1512333 0.0751312
NoSee 2 0.1717333 0.0662582
NoSee 3 0.2091000 0.0988226
NoSee 4 0.2591667 0.1212685
See 0 0.4141667 0.0504450
See 1 0.5525000 0.0525919
See 2 0.6191667 0.0801088
See 3 0.6183333 0.1228327
See 4 0.5150000 0.1749026
  • we can also calculate a correlation per subject separately per vision condition
library(broom)
CorrResult<-TrustLong.Final %>%
  group_by(Vision,Subject) %>%
  do(tidy(cor.test(.$Time, .$TrustFeeling)))
Vision Subject estimate statistic p.value parameter conf.low conf.high method alternative
NoSee 1 0.1645258 0.2889041 0.7914681 3 -0.8396155 0.9141048 Pearson’s product-moment correlation two.sided
NoSee 2 0.8261808 2.5398903 0.0846876 3 -0.2068905 0.9881634 Pearson’s product-moment correlation two.sided
NoSee 3 0.9577867 5.7706177 0.0103452 3 0.4873004 0.9973063 Pearson’s product-moment correlation two.sided
NoSee 4 0.7068877 1.7309782 0.1818874 3 -0.4660154 0.9787461 Pearson’s product-moment correlation two.sided
NoSee 5 0.8234354 2.5135828 0.0866641 3 -0.2150959 0.9879596 Pearson’s product-moment correlation two.sided
NoSee 6 0.9769363 7.9243892 0.0041901 3 0.6856064 0.9985416 Pearson’s product-moment correlation two.sided
  • or we can also run regression per subject separately per vision condition
RegresResult<-TrustLong.Final %>%
  group_by(Vision,Subject) %>%
  do(tidy(lm(TrustFeeling ~ Time, data=.)))
Vision Subject term estimate std.error statistic p.value
NoSee 1 (Intercept) 0.1319 0.0135657 9.7230572 0.0023108
NoSee 1 Time 0.0016 0.0055382 0.2889041 0.7914681
NoSee 2 (Intercept) 0.1909 0.0131159 14.5548039 0.0007033
NoSee 2 Time 0.0136 0.0053546 2.5398903 0.0846876
NoSee 3 (Intercept) 0.0179 0.0312414 0.5729568 0.6068013
NoSee 3 Time 0.0736 0.0127543 5.7706177 0.0103452
  • You mutate the dataset you just created and add a column created by another column
RegresResult <-
  RegresResult %>%
  mutate(Sig = if_else(p.value < .05,1,0))
Vision Subject term estimate std.error statistic p.value Sig
NoSee 1 (Intercept) 0.1319 0.0135657 9.7230572 0.0023108 1
NoSee 1 Time 0.0016 0.0055382 0.2889041 0.7914681 0
NoSee 2 (Intercept) 0.1909 0.0131159 14.5548039 0.0007033 1
NoSee 2 Time 0.0136 0.0053546 2.5398903 0.0846876 0
NoSee 3 (Intercept) 0.0179 0.0312414 0.5729568 0.6068013 0
NoSee 3 Time 0.0736 0.0127543 5.7706177 0.0103452 1
  • Now you can take that new dataset and summarize it (mean slope estimates just for Time by significant vs nonsignificant results) [Note: dplyr::filter is used because its conflicts with function in another package. Thus this forces filter function to come from the dplyr package.]
RSum<-RegresResult  %>%
  group_by(Vision,Sig) %>%
  dplyr::filter(term=="Time") %>%
  summarise(N=length(Sig),
            MeanSlope=mean(estimate))
Vision Sig N MeanSlope
NoSee 0 8 0.0099750
NoSee 1 4 0.0658500
See 0 9 0.0071111
See 1 3 0.0856667
  • Note: You can do this all at once because its conduct in the order you call it
RegresFinal<-TrustLong.Final %>%
  group_by(Vision,Subject) %>%
  do(tidy(lm(TrustFeeling ~ Time, data=.))) %>%
  mutate(Sig = if_else(p.value < .05,1,0)) %>%
  group_by(Vision,Sig) %>%
  dplyr::filter(term=="Time") %>%
  summarise(N=length(Sig),
            MeanSlope=mean(estimate))
  • You can pipe all these results into a ggplot

ggplot

Inspired by “The Grammar of Graphics” Leland Wilkinson 1999

“Destined to become a landmark in statistical graphics, this book provides a formal description of graphics, particularly static graphics, playing much the same role for graphics as probability theory played for statistics.”Journal of the American Statistical Association Former VP at SPSS Inc. Founder of SYSTAT. Adjunct Professor of Statistics at Northwestern University. He is also affiliated with the Computer Science department at The University of Illinois at Chicago.

Grammar

  • Data (data = ): Data.Frame to be mapped
  • Aesthetic mapping (aes( )): x = , y = , group = from your data. Can also add additional mapping into the aes:
    • (color = , shape = , size = , fill = , alpha =, etc): These can be fixed values are variables
      • Note: if place these calls into aes(x=IV, y=DV, color=subjectID), then the subject will appear in your legend with those colors. You can move some these maps into geom call to avoid that
  • Geometric object (geom_): bar, point, line, ribbons, shapes you want to graph based on your as
    • Position adjustments (position = ): goes with geom_ call such as position_dodge (don’t overlap), position_identity (leave as read in), position_jitter (jitters data points in scatterplot)
  • Statistical transformations (stats_): On the fly transforms (such as averaging): can be used instead of geoms (Note: I prefer to calculate stats outside of the plot when possible as its easier to see what you are doing)
  • Coordinate system (coord_): do you want to add coord_cartesian(), coord_polar(), etc
  • Scales (scale_) or simply (xlim, ylim). Override defaults to control many aspects of the graphs
  • Faceting (facet_): visual subsets: two options grid or wrap

Layering

  • First, you put on your jacket; then you put on your shoes, next underwear, finally you shower, right?
  • ggplot has very specific order that you should generally follow.
    • what is my data, what are my mappings, what are geoms, how should I position them, and last what do I want to do with the look
  • The order of which you add calls is the order they appear. So later calls will override earlier calls

Walkthrough

  • Step 1: ggplot and aes
library(ggplot2)
G1<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))
G1

  • Step 2: add geom
  • This can help you look up all the types
help.search("geom_", package = "ggplot2")
  • let’s start with points
G2<-G1+geom_point()
G2

  • Step 3: Best fit line
G3<-G2+geom_smooth()
G3

  • This line is loess, let’s make it lm with second order polynomial: \(y = x+x^2\), which we can write at poly(X,2)[orthogonal power polynomials]. [Note: you will not always need to call stats::poly, normally you can write just poly, but it is conflicting with another function]
?geom_smooth
G3a<-G2+geom_smooth(method='lm', formula = y ~ stats::poly(x,2))
G3a

  • Step 4: facet grid by Vision condition
G4a<-G3a+facet_grid(~Vision)
G4a

  • Step 4: Fancy the plot up

  • Add a theme

G4b<-G4a+theme_bw()
G4b

  • Modify the theme to change the minor grid lines
G4c<-G4b+theme(panel.grid.minor = element_blank())
G4c

  • Change the x and y labels
G4d<-G4c+xlab("Time Step")+ylab("Trust Score")
G4d

  • Override the geom_smooth with a different color line and set SE to false
G4e<-G4d+geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')
G4e

  • Wait why do I still see SE ribbons? Because that layer is still there!

  • Let’s make the graph all at once

Poly.Plot<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())
Poly.Plot

Speggetti Plot

  • We can add best fit line per subject, we just need to make the subject a grouping variable
    • aes(group=Subject)
      • This can be added either at geom_smooth if you just want it to apply to that geom such as below
Speg.Plot<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(aes(group=Subject),method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())+
  ggtitle("Trust Study")
Speg.Plot

  • or you can apply it the over-arching aes
Speg.Plot.2<-ggplot(data = TrustLong.Final, 
                    aes(x = Time , y = TrustFeeling,group=Subject))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())+
  ggtitle("Trust Study")
Speg.Plot.2

  • now you applied to the over-arching aes you can do other fun stuff, like color each data point and line relative the subject
    • Notice below where I added: aes(color=Subject)
Speg.Plot.3<-ggplot(data = TrustLong.Final, 
                    aes(x = Time, y = TrustFeeling, group=Subject, color=Subject))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())
Speg.Plot.3

  • What why is the color continuous?
str(TrustLong.Final)
## 'data.frame':    120 obs. of  5 variables:
##  $ Subject     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Personality : num  0.0223 0.6669 0.7458 0.8527 0.6066 ...
##  $ Vision      : chr  "NoSee" "NoSee" "NoSee" "NoSee" ...
##  $ Time        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrustFeeling: num  0.1301 0.2051 0.0601 0.2101 0.1451 ...
  • Because Subject is interger
TrustLong.Final$Subject.F<-as.factor(TrustLong.Final$Subject)
  • Remake the plot and replace Subject with Subject.F
  • also we can add Shape and linetype by Subject (in case we have to print in black and white)
Speg.Plot.4<-ggplot(data = TrustLong.Final, 
                    aes(x = Time, y = TrustFeeling, group=Subject.F, color=Subject.F,shape=Subject.F,linetype=Subject.F))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())
Speg.Plot.4

  • Remove the legend
Speg.Plot.4a<-Speg.Plot.4+theme(legend.position = "none")
Speg.Plot.4a

  • And you can make it totally “far out”
    • You can use HTML or color labels
Speg.Plot.4a + theme(plot.background = element_rect(size = 1, color = "blue", fill = "purple"),
        text=element_text(size = 12, family = "Serif", color = "ivory"),
        axis.text.y = element_text(colour = "magenta"),
        axis.text.x = element_text(colour = "green"),
        panel.background = element_rect(fill = "pink"),
        strip.background = element_rect(fill = "#ccff66"))

Tidyr & ggplot fun time

  • Just for fun lets run the polynomial regressions and extract the \(R^2\) per subject for Vision
TrustLong.Final$Subject.F<-as.factor(TrustLong.Final$Subject)
PolyRegress<-TrustLong.Final %>%
  group_by(Vision, Subject.F) %>%
  do(glance(lm(TrustFeeling ~ stats::poly(Time,2), data=.)))  %>%
  select(Vision,r.squared) 
Subject.F Vision r.squared
1 NoSee 0.2266071
2 NoSee 0.9536065
3 NoSee 0.9938067
4 NoSee 0.5280375
5 NoSee 0.9600112
6 NoSee 0.9997217
  • Let’s extract the Personality score and join it to that new dataset
PersonalityScore<-TrustLong.Final %>% 
  dplyr::filter(Time==0) %>% 
  select(Subject.F,Personality, Vision)

MergedData<-left_join(PolyRegress,PersonalityScore)
Subject.F Vision r.squared Personality
1 NoSee 0.2266071 0.0222995
2 NoSee 0.9536065 0.6669353
3 NoSee 0.9938067 0.7457592
4 NoSee 0.5280375 0.8527015
5 NoSee 0.9600112 0.6066250
6 NoSee 0.9997217 2.1633908
  • Scatter plot by Vision with fancy labels
Scatter.plot<-ggplot(data = MergedData, aes(Personality,r.squared))+
  geom_point(aes(shape=Vision))+
  xlab("Personaility Score")+
  ylab(expression(paste("Polynomial Regression, ", R^{2},' Result Per Subject')))+
  theme_minimal()
Scatter.plot

  • Reorder and rename labels
MergedData$Vision.O <- factor(MergedData$Vision,
                     levels = c("See","NoSee"),
                     labels = c("Partner is Visable","Partner is obscured"))
  • Add new labels and fix up legend
Scatter.plot2<-ggplot(data = MergedData, aes(Personality,r.squared))+
  geom_point(aes(shape=Vision.O),size = 2.5, stroke = 1.25)+
  xlab("Personaility Score")+
  ylab(expression(paste("Polynomial Regression, ", R^{2},' Result Per Subject')))+
  theme_minimal()+theme(legend.position = "top",
                   legend.text = element_text(size = 11, color = "gray50"),
                   legend.title=element_blank())
Scatter.plot2

Scatter.plot3<-Scatter.plot2+scale_shape_manual(values=c(21,24))
Scatter.plot3

  • Let’s add labels so we know which dots are which subjects
library("ggrepel")
Scatter.plot3 + scale_shape_manual(values=c(21,24))+
  geom_text_repel(aes(label=Subject.F), size = 3)

  • Label only a hand full of subjects
pointsToLabel <- c("1","6","12")
Scatter.plot4<-Scatter.plot3 + scale_shape_manual(values=c(21,24))+
    geom_text_repel(aes(label = Subject.F),
                    color = "gray20",
                    data = subset(MergedData, Subject.F %in% pointsToLabel),
                    force = 10)
Scatter.plot4

  • Fix up X axis
Scatter.plot5<- Scatter.plot4+scale_x_continuous(name = "Personality Score, Judging (1=Most)",
                       limits = c(0.0, 2.5),
                       breaks = seq(0.0, 2.5, by = 0.25)) 
Scatter.plot5

  • Add complex regression to the plot plot
Scatter.plot6 <-Scatter.plot5 +  
  geom_smooth(aes(linetype = Vision.O, group=Vision.O),
              method = "lm",
              formula = y ~ log(x), se = FALSE,
              color = "green")
Scatter.plot6

