Wrangling
- tidyr makes data tidy
- 2 workhouse functions
- gather: Makes wide data long
- spread: Makes long data wide
- 2 supplemental functions
- separate/unite: breaks/merges columns into multiple/one column
- extract: takes groups within a column to make new columns
Example Dataset
- Longitudinal data of trust over 5-time points (game plays) with the
within-subject variable of vision
- Do you trust your partner in the game of “trouble” when you can or
cannot see them
- Covariate (how “judging” personality type they are 1-3 scale)
- Data is in typical wide SPSS-like format
- Download
Data
TrustWide<-read.csv("Mixed/TrustWideData.csv")
Subject
|
NoSee_0
|
NoSee_1
|
NoSee_2
|
NoSee_3
|
NoSee_4
|
See_0
|
See_1
|
See_2
|
See_3
|
See_4
|
Personality
|
1
|
0.1301
|
0.1504
|
0.1209
|
0.1216
|
0.1525
|
0.295
|
0.44
|
0.475
|
0.42
|
0.265
|
0.0222995
|
2
|
0.2051
|
0.2004
|
0.1959
|
0.2316
|
0.2575
|
0.460
|
0.59
|
0.660
|
0.66
|
0.530
|
0.6669353
|
3
|
0.0601
|
0.0604
|
0.1309
|
0.2316
|
0.3425
|
0.430
|
0.61
|
0.720
|
0.77
|
0.750
|
0.7457592
|
4
|
0.2101
|
0.2404
|
0.2209
|
0.2416
|
0.2425
|
0.435
|
0.57
|
0.625
|
0.59
|
0.425
|
0.8527015
|
5
|
0.1451
|
0.1304
|
0.1359
|
0.1716
|
0.1975
|
0.420
|
0.54
|
0.610
|
0.60
|
0.490
|
0.6066250
|
6
|
0.1101
|
0.1404
|
0.1909
|
0.2616
|
0.3625
|
0.440
|
0.57
|
0.640
|
0.67
|
0.650
|
2.1633908
|
- This is not so useful for taking means and doing stats, so we will
convert it to long format
Convert to Long
- Process in words:
- We will use piping %>% to help us pass the
dataset along to each function (so you ignore when the function asks for
dataframe first)
- gather(new variable name, new value name, variables to merge)
library(tidyr)
library(dplyr)
TrustLong<-TrustWide %>% gather(Condition, TrustFeeling,NoSee_0:See_4)
str(TrustLong)
## 'data.frame': 120 obs. of 4 variables:
## $ Subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Personality : num 0.0223 0.6669 0.7458 0.8527 0.6066 ...
## $ Condition : chr "NoSee_0" "NoSee_0" "NoSee_0" "NoSee_0" ...
## $ TrustFeeling: num 0.1301 0.2051 0.0601 0.2101 0.1451 ...
Subject
|
Personality
|
Condition
|
TrustFeeling
|
1
|
0.0222995
|
NoSee_0
|
0.1301
|
2
|
0.6669353
|
NoSee_0
|
0.2051
|
3
|
0.7457592
|
NoSee_0
|
0.0601
|
4
|
0.8527015
|
NoSee_0
|
0.2101
|
5
|
0.6066250
|
NoSee_0
|
0.1451
|
6
|
2.1633908
|
NoSee_0
|
0.1101
|
- Personality applies to the subject, so every time we see the subject
names, we need their Personality score
- Our two variables (Vision & Time) are all stuck together. Good
for SPSS RM anova, but not helpful for us.
- separate(variable to split, c(“what do we call them”), what
separates them, convert them to back to numeric/integers if they are
numbers)
TrustLong.Final<-TrustLong %>% separate(Condition,c("Vision","Time"),sep="_", convert = TRUE)
Subject
|
Personality
|
Vision
|
Time
|
TrustFeeling
|
1
|
0.0222995
|
NoSee
|
0
|
0.1301
|
2
|
0.6669353
|
NoSee
|
0
|
0.2051
|
3
|
0.7457592
|
NoSee
|
0
|
0.0601
|
4
|
0.8527015
|
NoSee
|
0
|
0.2101
|
5
|
0.6066250
|
NoSee
|
0
|
0.1451
|
6
|
2.1633908
|
NoSee
|
0
|
0.1101
|
- You can do this all at once!
TrustLong.Final<-
TrustWide %>%
gather(Condition, TrustFeeling,NoSee_0:See_4) %>%
separate(Condition,c("Vision","Time"),sep="_", convert = TRUE)
Convert back to wide
- You just have to inverse the process. First, unite the variables
into one column and then spread them out
TrustWide.Again<-
TrustLong.Final %>%
unite(Condition,c("Vision","Time"),sep="_") %>%
spread(Condition, TrustFeeling)
Subject
|
Personality
|
NoSee_0
|
NoSee_1
|
NoSee_2
|
NoSee_3
|
NoSee_4
|
See_0
|
See_1
|
See_2
|
See_3
|
See_4
|
1
|
0.0222995
|
0.1301
|
0.1504
|
0.1209
|
0.1216
|
0.1525
|
0.295
|
0.44
|
0.475
|
0.42
|
0.265
|
2
|
0.6669353
|
0.2051
|
0.2004
|
0.1959
|
0.2316
|
0.2575
|
0.460
|
0.59
|
0.660
|
0.66
|
0.530
|
3
|
0.7457592
|
0.0601
|
0.0604
|
0.1309
|
0.2316
|
0.3425
|
0.430
|
0.61
|
0.720
|
0.77
|
0.750
|
4
|
0.8527015
|
0.2101
|
0.2404
|
0.2209
|
0.2416
|
0.2425
|
0.435
|
0.57
|
0.625
|
0.59
|
0.425
|
5
|
0.6066250
|
0.1451
|
0.1304
|
0.1359
|
0.1716
|
0.1975
|
0.420
|
0.54
|
0.610
|
0.60
|
0.490
|
6
|
2.1633908
|
0.1101
|
0.1404
|
0.1909
|
0.2616
|
0.3625
|
0.440
|
0.57
|
0.640
|
0.67
|
0.650
|
Data Manipulation
- dplyr lets you calculate means, sd or any other statistic you may
want on the data based on how the data was wrangled.
- 3 useful functions
- group_by (how to cut data)
- filter (subset on the fly)
- summarise (what descriptive to conduct)
- do (do more complex stats) [with the help of broom package which
converts regressions to tibbles]
- mutate (add stats back to data frame on the fly)
Group and summarise
- We can calculate means and sd per group
Means<-TrustLong.Final %>%
group_by(Vision,Time) %>%
summarise(MeanTrust=mean(TrustFeeling),
SDTrust=sd(TrustFeeling))
Vision
|
Time
|
MeanTrust
|
SDTrust
|
NoSee
|
0
|
0.1451000
|
0.0596200
|
NoSee
|
1
|
0.1512333
|
0.0751312
|
NoSee
|
2
|
0.1717333
|
0.0662582
|
NoSee
|
3
|
0.2091000
|
0.0988226
|
NoSee
|
4
|
0.2591667
|
0.1212685
|
See
|
0
|
0.4141667
|
0.0504450
|
See
|
1
|
0.5525000
|
0.0525919
|
See
|
2
|
0.6191667
|
0.0801088
|
See
|
3
|
0.6183333
|
0.1228327
|
See
|
4
|
0.5150000
|
0.1749026
|
- we can also calculate a correlation per subject separately per
vision condition
library(broom)
CorrResult<-TrustLong.Final %>%
group_by(Vision,Subject) %>%
do(tidy(cor.test(.$Time, .$TrustFeeling)))
Vision
|
Subject
|
estimate
|
statistic
|
p.value
|
parameter
|
conf.low
|
conf.high
|
method
|
alternative
|
NoSee
|
1
|
0.1645258
|
0.2889041
|
0.7914681
|
3
|
-0.8396155
|
0.9141048
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
2
|
0.8261808
|
2.5398903
|
0.0846876
|
3
|
-0.2068905
|
0.9881634
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
3
|
0.9577867
|
5.7706177
|
0.0103452
|
3
|
0.4873004
|
0.9973063
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
4
|
0.7068877
|
1.7309782
|
0.1818874
|
3
|
-0.4660154
|
0.9787461
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
5
|
0.8234354
|
2.5135828
|
0.0866641
|
3
|
-0.2150959
|
0.9879596
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
6
|
0.9769363
|
7.9243892
|
0.0041901
|
3
|
0.6856064
|
0.9985416
|
Pearson’s product-moment correlation
|
two.sided
|
- or we can also run regression per subject separately per vision
condition
RegresResult<-TrustLong.Final %>%
group_by(Vision,Subject) %>%
do(tidy(lm(TrustFeeling ~ Time, data=.)))
Vision
|
Subject
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
NoSee
|
1
|
(Intercept)
|
0.1319
|
0.0135657
|
9.7230572
|
0.0023108
|
NoSee
|
1
|
Time
|
0.0016
|
0.0055382
|
0.2889041
|
0.7914681
|
NoSee
|
2
|
(Intercept)
|
0.1909
|
0.0131159
|
14.5548039
|
0.0007033
|
NoSee
|
2
|
Time
|
0.0136
|
0.0053546
|
2.5398903
|
0.0846876
|
NoSee
|
3
|
(Intercept)
|
0.0179
|
0.0312414
|
0.5729568
|
0.6068013
|
NoSee
|
3
|
Time
|
0.0736
|
0.0127543
|
5.7706177
|
0.0103452
|
- You mutate the dataset you just created and add a column created by
another column
RegresResult <-
RegresResult %>%
mutate(Sig = if_else(p.value < .05,1,0))
Vision
|
Subject
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
Sig
|
NoSee
|
1
|
(Intercept)
|
0.1319
|
0.0135657
|
9.7230572
|
0.0023108
|
1
|
NoSee
|
1
|
Time
|
0.0016
|
0.0055382
|
0.2889041
|
0.7914681
|
0
|
NoSee
|
2
|
(Intercept)
|
0.1909
|
0.0131159
|
14.5548039
|
0.0007033
|
1
|
NoSee
|
2
|
Time
|
0.0136
|
0.0053546
|
2.5398903
|
0.0846876
|
0
|
NoSee
|
3
|
(Intercept)
|
0.0179
|
0.0312414
|
0.5729568
|
0.6068013
|
0
|
NoSee
|
3
|
Time
|
0.0736
|
0.0127543
|
5.7706177
|
0.0103452
|
1
|
- Now you can take that new dataset and summarize it (mean slope
estimates just for Time by significant vs nonsignificant results)
[Note: dplyr::filter is used because its conflicts with function in
another package. Thus this forces filter function to come from the dplyr
package.]
RSum<-RegresResult %>%
group_by(Vision,Sig) %>%
dplyr::filter(term=="Time") %>%
summarise(N=length(Sig),
MeanSlope=mean(estimate))
Vision
|
Sig
|
N
|
MeanSlope
|
NoSee
|
0
|
8
|
0.0099750
|
NoSee
|
1
|
4
|
0.0658500
|
See
|
0
|
9
|
0.0071111
|
See
|
1
|
3
|
0.0856667
|
- Note: You can do this all at once because its conduct in the order
you call it
RegresFinal<-TrustLong.Final %>%
group_by(Vision,Subject) %>%
do(tidy(lm(TrustFeeling ~ Time, data=.))) %>%
mutate(Sig = if_else(p.value < .05,1,0)) %>%
group_by(Vision,Sig) %>%
dplyr::filter(term=="Time") %>%
summarise(N=length(Sig),
MeanSlope=mean(estimate))
- You can pipe all these results into a ggplot
ggplot
Inspired by “The Grammar of Graphics” Leland Wilkinson 1999
“Destined to become a landmark in statistical graphics, this book
provides a formal description of graphics, particularly static graphics,
playing much the same role for graphics as probability theory played for
statistics.”Journal of the American Statistical Association Former VP at
SPSS Inc. Founder of SYSTAT. Adjunct Professor of Statistics at
Northwestern University. He is also affiliated with the Computer Science
department at The University of Illinois at Chicago.
Grammar
- Data (data = ): Data.Frame to be
mapped
- Aesthetic mapping (aes( )): x = , y = ,
group = from your data. Can also add additional mapping into
the aes:
- (color = , shape = , size = , fill =
, alpha =, etc): These can be fixed values are variables
- Note: if place these calls into aes(x=IV, y=DV, color=subjectID),
then the subject will appear in your legend with those colors. You can
move some these maps into geom call to avoid that
- Geometric object (geom_): bar, point,
line, ribbons, shapes you want to graph based on your
as
- Position adjustments (position = ): goes with
geom_ call such as position_dodge (don’t overlap),
position_identity (leave as read in), position_jitter
(jitters data points in scatterplot)
- Statistical transformations (stats_): On the fly transforms
(such as averaging): can be used instead of geoms (Note: I
prefer to calculate stats outside of the plot when possible as its
easier to see what you are doing)
- Coordinate system (coord_): do you want to add
coord_cartesian(), coord_polar(), etc
- Scales (scale_) or simply (xlim, ylim). Override defaults
to control many aspects of the graphs
- Faceting (facet_): visual subsets: two options
grid or wrap
Layering
- First, you put on your jacket; then you put on your shoes, next
underwear, finally you shower, right?
- ggplot has very specific order that you should generally follow.
- what is my data, what are my mappings, what are geoms, how should I
position them, and last what do I want to do with the look
- The order of which you add calls is the order they appear. So later
calls will override earlier calls
Walkthrough
library(ggplot2)
G1<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))
G1
- Step 2: add geom
- This can help you look up all the types
help.search("geom_", package = "ggplot2")
G2<-G1+geom_point()
G2
G3<-G2+geom_smooth()
G3
- This line is loess, let’s make it lm with second order polynomial:
\(y = x+x^2\), which we can write at
poly(X,2)
[orthogonal power polynomials]. [Note: you
will not always need to call stats::poly, normally you can write just
poly, but it is conflicting with another function]
?geom_smooth
G3a<-G2+geom_smooth(method='lm', formula = y ~ stats::poly(x,2))
G3a
- Step 4: facet grid by Vision condition
G4a<-G3a+facet_grid(~Vision)
G4a
G4b<-G4a+theme_bw()
G4b
- Modify the theme to change the minor grid lines
G4c<-G4b+theme(panel.grid.minor = element_blank())
G4c
- Change the x and y labels
G4d<-G4c+xlab("Time Step")+ylab("Trust Score")
G4d
- Override the geom_smooth with a different color line and set SE to
false
G4e<-G4d+geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')
G4e
Poly.Plot<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())
Poly.Plot
Speggetti Plot
- We can add best fit line per subject, we just need to make the
subject a grouping variable
- aes(group=Subject)
- This can be added either at geom_smooth if you just want it to apply
to that geom such as below
Speg.Plot<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(aes(group=Subject),method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())+
ggtitle("Trust Study")
Speg.Plot
- or you can apply it the over-arching aes
Speg.Plot.2<-ggplot(data = TrustLong.Final,
aes(x = Time , y = TrustFeeling,group=Subject))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())+
ggtitle("Trust Study")
Speg.Plot.2
- now you applied to the over-arching aes you can do other fun stuff,
like color each data point and line relative the subject
- Notice below where I added: aes(color=Subject)
Speg.Plot.3<-ggplot(data = TrustLong.Final,
aes(x = Time, y = TrustFeeling, group=Subject, color=Subject))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())
Speg.Plot.3
- What why is the color continuous?
str(TrustLong.Final)
## 'data.frame': 120 obs. of 5 variables:
## $ Subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Personality : num 0.0223 0.6669 0.7458 0.8527 0.6066 ...
## $ Vision : chr "NoSee" "NoSee" "NoSee" "NoSee" ...
## $ Time : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TrustFeeling: num 0.1301 0.2051 0.0601 0.2101 0.1451 ...
- Because Subject is interger
TrustLong.Final$Subject.F<-as.factor(TrustLong.Final$Subject)
- Remake the plot and replace Subject with Subject.F
- also we can add Shape and linetype by Subject (in case we have to
print in black and white)
Speg.Plot.4<-ggplot(data = TrustLong.Final,
aes(x = Time, y = TrustFeeling, group=Subject.F, color=Subject.F,shape=Subject.F,linetype=Subject.F))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())
Speg.Plot.4
Speg.Plot.4a<-Speg.Plot.4+theme(legend.position = "none")
Speg.Plot.4a
- And you can make it totally “far out”
- You can use HTML or color labels
Speg.Plot.4a + theme(plot.background = element_rect(size = 1, color = "blue", fill = "purple"),
text=element_text(size = 12, family = "Serif", color = "ivory"),
axis.text.y = element_text(colour = "magenta"),
axis.text.x = element_text(colour = "green"),
panel.background = element_rect(fill = "pink"),
strip.background = element_rect(fill = "#ccff66"))
Tidyr & ggplot fun time
- Just for fun lets run the polynomial regressions and extract the
\(R^2\) per subject for Vision
TrustLong.Final$Subject.F<-as.factor(TrustLong.Final$Subject)
PolyRegress<-TrustLong.Final %>%
group_by(Vision, Subject.F) %>%
do(glance(lm(TrustFeeling ~ stats::poly(Time,2), data=.))) %>%
select(Vision,r.squared)
Subject.F
|
Vision
|
r.squared
|
1
|
NoSee
|
0.2266071
|
2
|
NoSee
|
0.9536065
|
3
|
NoSee
|
0.9938067
|
4
|
NoSee
|
0.5280375
|
5
|
NoSee
|
0.9600112
|
6
|
NoSee
|
0.9997217
|
- Let’s extract the Personality score and join it to that new
dataset
PersonalityScore<-TrustLong.Final %>%
dplyr::filter(Time==0) %>%
select(Subject.F,Personality, Vision)
MergedData<-left_join(PolyRegress,PersonalityScore)
Subject.F
|
Vision
|
r.squared
|
Personality
|
1
|
NoSee
|
0.2266071
|
0.0222995
|
2
|
NoSee
|
0.9536065
|
0.6669353
|
3
|
NoSee
|
0.9938067
|
0.7457592
|
4
|
NoSee
|
0.5280375
|
0.8527015
|
5
|
NoSee
|
0.9600112
|
0.6066250
|
6
|
NoSee
|
0.9997217
|
2.1633908
|
- Scatter plot by Vision with fancy labels
Scatter.plot<-ggplot(data = MergedData, aes(Personality,r.squared))+
geom_point(aes(shape=Vision))+
xlab("Personaility Score")+
ylab(expression(paste("Polynomial Regression, ", R^{2},' Result Per Subject')))+
theme_minimal()
Scatter.plot
- Reorder and rename labels
MergedData$Vision.O <- factor(MergedData$Vision,
levels = c("See","NoSee"),
labels = c("Partner is Visable","Partner is obscured"))
- Add new labels and fix up legend
Scatter.plot2<-ggplot(data = MergedData, aes(Personality,r.squared))+
geom_point(aes(shape=Vision.O),size = 2.5, stroke = 1.25)+
xlab("Personaility Score")+
ylab(expression(paste("Polynomial Regression, ", R^{2},' Result Per Subject')))+
theme_minimal()+theme(legend.position = "top",
legend.text = element_text(size = 11, color = "gray50"),
legend.title=element_blank())
Scatter.plot2
Scatter.plot3<-Scatter.plot2+scale_shape_manual(values=c(21,24))
Scatter.plot3
- Let’s add labels so we know which dots are which subjects
library("ggrepel")
Scatter.plot3 + scale_shape_manual(values=c(21,24))+
geom_text_repel(aes(label=Subject.F), size = 3)
- Label only a hand full of subjects
pointsToLabel <- c("1","6","12")
Scatter.plot4<-Scatter.plot3 + scale_shape_manual(values=c(21,24))+
geom_text_repel(aes(label = Subject.F),
color = "gray20",
data = subset(MergedData, Subject.F %in% pointsToLabel),
force = 10)
Scatter.plot4
Scatter.plot5<- Scatter.plot4+scale_x_continuous(name = "Personality Score, Judging (1=Most)",
limits = c(0.0, 2.5),
breaks = seq(0.0, 2.5, by = 0.25))
Scatter.plot5
- Add complex regression to the plot plot
Scatter.plot6 <-Scatter.plot5 +
geom_smooth(aes(linetype = Vision.O, group=Vision.O),
method = "lm",
formula = y ~ log(x), se = FALSE,
color = "green")
Scatter.plot6
---
title: 'Tidyverse Overview'
output:
  html_document:
    code_download: yes
    fontsize: 8pt
    highlight: textmate
    number_sections: no
    theme: flatly
    toc: yes
    toc_float:
      collapsed: no
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning =  FALSE)
knitr::opts_chunk$set(fig.width=5.25)
knitr::opts_chunk$set(fig.height=4.0)
knitr::opts_chunk$set(fig.align='center') 
knitr::opts_chunk$set(results='hold') 
```


# Tidyverse 
- Family of packages developed to streamline graphing, data manipulation, data wrangling, and programming
- Main packages of interest for us today: ggplot2, dplyr, tidyr (these have evolved to the point that they have replaced packages we converted in the r class: reshape2 and plyr)
- These packages can work on a modern type of data frame called a tibble 
- ggplot2 is for graphing
- dplyr is for data manipulation (think excel pivot tables on steroids)
- tidyr is for data wrangling (long to wide format and back again)
- These packages can use piping (in dplyr which calls purrr) which makes it easier to string together commands 


# Wrangling 
- tidyr makes data tidy
- 2 workhouse functions
    - gather: Makes wide data long
    - spread: Makes long data wide
- 2 supplemental functions
    - separate/unite: breaks/merges columns into multiple/one column
    - extract: takes groups within a column to make new columns 

## Example Dataset
- Longitudinal data of trust over 5-time points (game plays) with the within-subject variable of vision
- Do you trust your partner in the game of "trouble" when you can or cannot see them 
- Covariate (how "judging" personality type they are 1-3 scale)
- Data is in typical wide SPSS-like format
- [Download Data](www.alexanderdemos.org/Mixed/TrustWideData.csv)

```{r}
TrustWide<-read.csv("Mixed/TrustWideData.csv")
```

```{r, echo=FALSE}
library(knitr); library(kableExtra) 
kable(head(TrustWide), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```


- This is not so useful for taking means and doing stats, so we will convert it to long format

## Convert to Long
- Process in words: 
    - We will use piping %>% to help us **pass** the dataset along to each function (so you ignore when the function asks for dataframe first)
    - gather(new variable name, new value name, variables to merge)

```{r}
library(tidyr)
library(dplyr)
TrustLong<-TrustWide %>% gather(Condition, TrustFeeling,NoSee_0:See_4)
str(TrustLong)
```


```{r, results='asis',echo=FALSE}
kable(head(TrustLong), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```


- Personality applies to the subject, so every time we see the subject names, we need their Personality score
- Our two variables (Vision & Time) are all stuck together. Good for SPSS RM anova, but not helpful for us.
    - separate(variable to split, c("what do we call them"), what separates them, convert them to back to numeric/integers if they are numbers) 

```{r}
TrustLong.Final<-TrustLong %>% separate(Condition,c("Vision","Time"),sep="_", convert = TRUE)
```

```{r, results='asis',echo=FALSE}
kable(head(TrustLong.Final), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

- You can do this all at once!

```{r}
TrustLong.Final<-
  TrustWide %>% 
  gather(Condition, TrustFeeling,NoSee_0:See_4) %>% 
  separate(Condition,c("Vision","Time"),sep="_", convert = TRUE)
```

### Convert back to wide
- You just have to inverse the process. First, unite the variables into one column and then spread them out

```{r}
TrustWide.Again<-
  TrustLong.Final %>% 
  unite(Condition,c("Vision","Time"),sep="_") %>% 
  spread(Condition, TrustFeeling)
```

```{r, results='asis',echo=FALSE}
kable(head(TrustWide.Again), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

# Data Manipulation 
- dplyr lets you calculate means, sd or any other statistic you may want on the data based on how the data was wrangled.
- 3 useful functions
  - group_by (how to cut data)
  - filter (subset on the fly)
  - summarise (what descriptive to conduct)
  - do (do more complex stats) [with the help of broom package which converts regressions to tibbles]
  - mutate (add stats back to data frame on the fly)
  
## Group and summarise
- We can calculate means and sd per group
```{r}
Means<-TrustLong.Final %>%
  group_by(Vision,Time) %>%
  summarise(MeanTrust=mean(TrustFeeling),
            SDTrust=sd(TrustFeeling))
```

```{r, results='asis',echo=FALSE}
kable(Means, "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

- we can also calculate a correlation per subject separately per vision condition

```{r}
library(broom)
CorrResult<-TrustLong.Final %>%
  group_by(Vision,Subject) %>%
  do(tidy(cor.test(.$Time, .$TrustFeeling)))
```

```{r, results='asis',echo=FALSE}
kable(head(CorrResult), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

- or we can also run regression per subject separately per vision condition 

```{r}
RegresResult<-TrustLong.Final %>%
  group_by(Vision,Subject) %>%
  do(tidy(lm(TrustFeeling ~ Time, data=.)))

```

```{r, results='asis',echo=FALSE}
kable(head(RegresResult), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

- You mutate the dataset you just created and add a column created by another column

```{r}
RegresResult <-
  RegresResult %>%
  mutate(Sig = if_else(p.value < .05,1,0))
```

```{r, results='asis',echo=FALSE}
kable(head(RegresResult), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

- Now you can take that new dataset and summarize it (mean slope estimates just for Time by significant vs nonsignificant results) [*Note: dplyr::filter is used because its conflicts with function in another package. Thus this forces filter function to come from the dplyr package.*]

```{r}
RSum<-RegresResult  %>%
  group_by(Vision,Sig) %>%
  dplyr::filter(term=="Time") %>%
  summarise(N=length(Sig),
            MeanSlope=mean(estimate))
```

```{r, results='asis',echo=FALSE}
kable(RSum, "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

- Note: You can do this all at once because its conduct in the order you call it

```{r}
RegresFinal<-TrustLong.Final %>%
  group_by(Vision,Subject) %>%
  do(tidy(lm(TrustFeeling ~ Time, data=.))) %>%
  mutate(Sig = if_else(p.value < .05,1,0)) %>%
  group_by(Vision,Sig) %>%
  dplyr::filter(term=="Time") %>%
  summarise(N=length(Sig),
            MeanSlope=mean(estimate))
```

- You can pipe all these results into a ggplot

# ggplot
Inspired by "The Grammar of Graphics" Leland Wilkinson 1999 

> "Destined to become a landmark in statistical graphics, this book provides a formal description of graphics, particularly static graphics, playing much the same role for graphics as probability theory played for statistics."Journal of the American Statistical Association Former VP at SPSS Inc. Founder of SYSTAT. Adjunct Professor of Statistics at Northwestern University. He is also affiliated with the Computer Science department at The University of Illinois at Chicago.

- Manual: http://ggplot2.tidyverse.org/reference/

## Grammar 
- Data (*data = *): **Data.Frame** to be mapped 
- Aesthetic mapping (*aes( )*): *x = *, *y = *, *group =*  from your data. Can also add additional mapping into the aes:
    - (*color = *, *shape = *, *size = *, *fill = *, *alpha =*, etc): These can be fixed values are variables
        - Note: if place these calls into aes(x=IV, y=DV, color=subjectID), then the subject will appear in your legend with those colors. You can move some these maps into geom call to avoid that
- Geometric object (*geom_*): *bar*, *point*, *line*, *ribbons*, shapes you want to graph based on your *as* 
    - Position adjustments (*position = *): goes with *geom_* call such as *position_dodge* (don't overlap), *position_identity* (leave as read in), *position_jitter*  (jitters data points in scatterplot)
- Statistical transformations (*stats_*): On the fly transforms (such as averaging): can be used instead of geoms (*Note*: I prefer to calculate stats outside of the plot when possible as its easier to see what you are doing)
- Coordinate system (*coord_*): do you want to add *coord_cartesian()*, *coord_polar()*, etc
- Scales (*scale_*) or simply (xlim, ylim).  Override defaults to control many aspects of the graphs
- Faceting (*facet_*): visual subsets: two options *grid* or *wrap*

### Layering
- First, you put on your jacket; then you put on your shoes, next underwear, finally you shower, right?
- ggplot has very specific order that you should generally follow. 
    - what is my data, what are my mappings, what are geoms, how should I position them, and last what do I want to do with the look 
- The order of which you add calls is the order they appear. So later calls will override earlier calls

## Walkthrough

- Step 1: ggplot and aes
```{r}
library(ggplot2)
G1<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))
G1
```

- Step 2: add geom
- This can help you look up all the types

```{r, eval=FALSE}
help.search("geom_", package = "ggplot2")
```

- let's start with points

```{r}
G2<-G1+geom_point()
G2
```

- Step 3: Best fit line

```{r}
G3<-G2+geom_smooth()
G3
```

- This line is loess, let's make it lm with second order polynomial: $y = x+x^2$, which we can write at `poly(X,2)`[orthogonal power polynomials]. [*Note: you will not always need to call stats::poly, normally you can write just poly, but it is conflicting with another function*]

```{r, eval=FALSE}
?geom_smooth
```

```{r poly1}
G3a<-G2+geom_smooth(method='lm', formula = y ~ stats::poly(x,2))
G3a
```

- Step 4: facet grid by Vision condition

```{r}
G4a<-G3a+facet_grid(~Vision)
G4a
```


- Step 4: Fancy the plot up

- Add a theme
```{r}
G4b<-G4a+theme_bw()
G4b
```

- Modify the theme to change the minor grid lines

```{r}
G4c<-G4b+theme(panel.grid.minor = element_blank())
G4c
```

- Change the x and y labels
```{r}
G4d<-G4c+xlab("Time Step")+ylab("Trust Score")
G4d
```


- Override the geom_smooth with a different color line and set SE to false

```{r poly2}
G4e<-G4d+geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')
G4e
```

- Wait why do I still see SE ribbons? Because that layer is still there!

- Let's make the graph all at once

```{r}
Poly.Plot<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())
Poly.Plot
```

### Speggetti Plot
- We can add best fit line per subject, we just need to make the subject a grouping variable
    - aes(group=Subject)
        - This can be added either at geom_smooth if you just want it to apply to that geom such as below 

```{r speg1}
Speg.Plot<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(aes(group=Subject),method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())+
  ggtitle("Trust Study")
Speg.Plot
```

- or you can apply it the over-arching aes 
```{r}
Speg.Plot.2<-ggplot(data = TrustLong.Final, 
                    aes(x = Time , y = TrustFeeling,group=Subject))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())+
  ggtitle("Trust Study")
Speg.Plot.2
```

- now you applied to the over-arching aes you can do other fun stuff, like color each data point and line relative the subject
    - Notice below where I added: aes(color=Subject) 
    
```{r spegcolor}
Speg.Plot.3<-ggplot(data = TrustLong.Final, 
                    aes(x = Time, y = TrustFeeling, group=Subject, color=Subject))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())
Speg.Plot.3
```


- What why is the color continuous? 

```{r}
str(TrustLong.Final)
```

- Because Subject is interger

```{r}
TrustLong.Final$Subject.F<-as.factor(TrustLong.Final$Subject)
```

- Remake the plot and replace Subject with Subject.F
- also we can add Shape and linetype by Subject (in case we have to print in black and white)

```{r seg4}
Speg.Plot.4<-ggplot(data = TrustLong.Final, 
                    aes(x = Time, y = TrustFeeling, group=Subject.F, color=Subject.F,shape=Subject.F,linetype=Subject.F))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())
Speg.Plot.4
```

- Remove the legend

```{r, eval=FALSE}
Speg.Plot.4a<-Speg.Plot.4+theme(legend.position = "none")
Speg.Plot.4a
```

```{r, echo=FALSE}
Speg.Plot.4a<-ggplot(data = TrustLong.Final, 
                    aes(x = Time, y = TrustFeeling, group=Subject.F, color=Subject.F,shape=Subject.F,linetype=Subject.F))+
  facet_grid(~Vision)+
  geom_point()+
  geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
  xlab("Time Step")+ylab("Trust Score")+
  theme_bw()+
  theme(panel.grid.minor = element_blank())+theme(legend.position = "none")
Speg.Plot.4a
```


- And you can make it totally "far out"
    - You can use HTML or color labels 

```{r}
Speg.Plot.4a + theme(plot.background = element_rect(size = 1, color = "blue", fill = "purple"),
        text=element_text(size = 12, family = "Serif", color = "ivory"),
        axis.text.y = element_text(colour = "magenta"),
        axis.text.x = element_text(colour = "green"),
        panel.background = element_rect(fill = "pink"),
        strip.background = element_rect(fill = "#ccff66"))
```

# Tidyr & ggplot fun time
- Just for fun lets run the polynomial regressions and extract the $R^2$ per subject for Vision

```{r dolm}
TrustLong.Final$Subject.F<-as.factor(TrustLong.Final$Subject)
PolyRegress<-TrustLong.Final %>%
  group_by(Vision, Subject.F) %>%
  do(glance(lm(TrustFeeling ~ stats::poly(Time,2), data=.)))  %>%
  select(Vision,r.squared) 
```


```{r, results='asis',echo=FALSE}
kable(head(PolyRegress), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

- Let's extract the Personality score and join it to that new dataset  

```{r}
PersonalityScore<-TrustLong.Final %>% 
  dplyr::filter(Time==0) %>% 
  select(Subject.F,Personality, Vision)

MergedData<-left_join(PolyRegress,PersonalityScore)
```

```{r, results='asis',echo=FALSE}
kable(head(MergedData), "html", booktabs = T) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

- Scatter plot by Vision with fancy labels

```{r}
Scatter.plot<-ggplot(data = MergedData, aes(Personality,r.squared))+
  geom_point(aes(shape=Vision))+
  xlab("Personaility Score")+
  ylab(expression(paste("Polynomial Regression, ", R^{2},' Result Per Subject')))+
  theme_minimal()
Scatter.plot
```

- Reorder and rename labels
```{r}
MergedData$Vision.O <- factor(MergedData$Vision,
                     levels = c("See","NoSee"),
                     labels = c("Partner is Visable","Partner is obscured"))
```

- Add new labels and fix up legend

```{r}
Scatter.plot2<-ggplot(data = MergedData, aes(Personality,r.squared))+
  geom_point(aes(shape=Vision.O),size = 2.5, stroke = 1.25)+
  xlab("Personaility Score")+
  ylab(expression(paste("Polynomial Regression, ", R^{2},' Result Per Subject')))+
  theme_minimal()+theme(legend.position = "top",
                   legend.text = element_text(size = 11, color = "gray50"),
                   legend.title=element_blank())
Scatter.plot2
```

- Let's change the shape
    - http://sape.inf.usi.ch/quick-reference/ggplot2/shape
    
```{r}
Scatter.plot3<-Scatter.plot2+scale_shape_manual(values=c(21,24))
Scatter.plot3
```

- Let's add labels so we know which dots are which subjects

```{r}
library("ggrepel")
Scatter.plot3 + scale_shape_manual(values=c(21,24))+
  geom_text_repel(aes(label=Subject.F), size = 3)
```

- Label only a hand full of subjects

```{r}
pointsToLabel <- c("1","6","12")
Scatter.plot4<-Scatter.plot3 + scale_shape_manual(values=c(21,24))+
    geom_text_repel(aes(label = Subject.F),
                    color = "gray20",
                    data = subset(MergedData, Subject.F %in% pointsToLabel),
                    force = 10)
Scatter.plot4
```

- Fix up X axis

```{r}
Scatter.plot5<- Scatter.plot4+scale_x_continuous(name = "Personality Score, Judging (1=Most)",
                       limits = c(0.0, 2.5),
                       breaks = seq(0.0, 2.5, by = 0.25)) 
Scatter.plot5
```

- Add complex regression to the plot plot

```{r}
Scatter.plot6 <-Scatter.plot5 +  
  geom_smooth(aes(linetype = Vision.O, group=Vision.O),
              method = "lm",
              formula = y ~ log(x), se = FALSE,
              color = "green")
Scatter.plot6
```

<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-90415160-1', 'auto');
  ga('send', 'pageview');

</script>