Wrangling
- tidyr makes data tidy
- 2 workhouse functions
- gather: Makes wide data long
- spread: Makes long data wide
- 2 supplemental functions
- separate/unite: breaks/merges columns into multiple/one column
- extract: takes groups within a column to make new columns
Example Dataset
- Longitudinal data of trust over 5-time points (game plays) with the
within-subject variable of vision
- Do you trust your partner in the game of “trouble” when you can or
cannot see them
- Covariate (how “judging” personality type they are 1-3 scale)
- Data is in typical wide SPSS-like format
- Download
Data
TrustWide<-read.csv("Mixed/TrustWideData.csv")
Subject
|
NoSee_0
|
NoSee_1
|
NoSee_2
|
NoSee_3
|
NoSee_4
|
See_0
|
See_1
|
See_2
|
See_3
|
See_4
|
Personality
|
1
|
0.1301
|
0.1504
|
0.1209
|
0.1216
|
0.1525
|
0.295
|
0.44
|
0.475
|
0.42
|
0.265
|
0.0222995
|
2
|
0.2051
|
0.2004
|
0.1959
|
0.2316
|
0.2575
|
0.460
|
0.59
|
0.660
|
0.66
|
0.530
|
0.6669353
|
3
|
0.0601
|
0.0604
|
0.1309
|
0.2316
|
0.3425
|
0.430
|
0.61
|
0.720
|
0.77
|
0.750
|
0.7457592
|
4
|
0.2101
|
0.2404
|
0.2209
|
0.2416
|
0.2425
|
0.435
|
0.57
|
0.625
|
0.59
|
0.425
|
0.8527015
|
5
|
0.1451
|
0.1304
|
0.1359
|
0.1716
|
0.1975
|
0.420
|
0.54
|
0.610
|
0.60
|
0.490
|
0.6066250
|
6
|
0.1101
|
0.1404
|
0.1909
|
0.2616
|
0.3625
|
0.440
|
0.57
|
0.640
|
0.67
|
0.650
|
2.1633908
|
- This is not so useful for taking means and doing stats, so we will
convert it to long format
Convert to Long
- Process in words:
- We will use piping %>% to help us pass the
dataset along to each function (so you ignore when the function asks for
dataframe first)
- gather(new variable name, new value name, variables to merge)
library(tidyr)
library(dplyr)
TrustLong<-TrustWide %>% gather(Condition, TrustFeeling,NoSee_0:See_4)
str(TrustLong)
## 'data.frame': 120 obs. of 4 variables:
## $ Subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Personality : num 0.0223 0.6669 0.7458 0.8527 0.6066 ...
## $ Condition : chr "NoSee_0" "NoSee_0" "NoSee_0" "NoSee_0" ...
## $ TrustFeeling: num 0.1301 0.2051 0.0601 0.2101 0.1451 ...
Subject
|
Personality
|
Condition
|
TrustFeeling
|
1
|
0.0222995
|
NoSee_0
|
0.1301
|
2
|
0.6669353
|
NoSee_0
|
0.2051
|
3
|
0.7457592
|
NoSee_0
|
0.0601
|
4
|
0.8527015
|
NoSee_0
|
0.2101
|
5
|
0.6066250
|
NoSee_0
|
0.1451
|
6
|
2.1633908
|
NoSee_0
|
0.1101
|
- Personality applies to the subject, so every time we see the subject
names, we need their Personality score
- Our two variables (Vision & Time) are all stuck together. Good
for SPSS RM anova, but not helpful for us.
- separate(variable to split, c(“what do we call them”), what
separates them, convert them to back to numeric/integers if they are
numbers)
TrustLong.Final<-TrustLong %>% separate(Condition,c("Vision","Time"),sep="_", convert = TRUE)
Subject
|
Personality
|
Vision
|
Time
|
TrustFeeling
|
1
|
0.0222995
|
NoSee
|
0
|
0.1301
|
2
|
0.6669353
|
NoSee
|
0
|
0.2051
|
3
|
0.7457592
|
NoSee
|
0
|
0.0601
|
4
|
0.8527015
|
NoSee
|
0
|
0.2101
|
5
|
0.6066250
|
NoSee
|
0
|
0.1451
|
6
|
2.1633908
|
NoSee
|
0
|
0.1101
|
- You can do this all at once!
TrustLong.Final<-
TrustWide %>%
gather(Condition, TrustFeeling,NoSee_0:See_4) %>%
separate(Condition,c("Vision","Time"),sep="_", convert = TRUE)
Convert back to wide
- You just have to inverse the process. First, unite the variables
into one column and then spread them out
TrustWide.Again<-
TrustLong.Final %>%
unite(Condition,c("Vision","Time"),sep="_") %>%
spread(Condition, TrustFeeling)
Subject
|
Personality
|
NoSee_0
|
NoSee_1
|
NoSee_2
|
NoSee_3
|
NoSee_4
|
See_0
|
See_1
|
See_2
|
See_3
|
See_4
|
1
|
0.0222995
|
0.1301
|
0.1504
|
0.1209
|
0.1216
|
0.1525
|
0.295
|
0.44
|
0.475
|
0.42
|
0.265
|
2
|
0.6669353
|
0.2051
|
0.2004
|
0.1959
|
0.2316
|
0.2575
|
0.460
|
0.59
|
0.660
|
0.66
|
0.530
|
3
|
0.7457592
|
0.0601
|
0.0604
|
0.1309
|
0.2316
|
0.3425
|
0.430
|
0.61
|
0.720
|
0.77
|
0.750
|
4
|
0.8527015
|
0.2101
|
0.2404
|
0.2209
|
0.2416
|
0.2425
|
0.435
|
0.57
|
0.625
|
0.59
|
0.425
|
5
|
0.6066250
|
0.1451
|
0.1304
|
0.1359
|
0.1716
|
0.1975
|
0.420
|
0.54
|
0.610
|
0.60
|
0.490
|
6
|
2.1633908
|
0.1101
|
0.1404
|
0.1909
|
0.2616
|
0.3625
|
0.440
|
0.57
|
0.640
|
0.67
|
0.650
|
Data Manipulation
- dplyr lets you calculate means, sd or any other statistic you may
want on the data based on how the data was wrangled.
- 3 useful functions
- group_by (how to cut data)
- filter (subset on the fly)
- summarise (what descriptive to conduct)
- do (do more complex stats) [with the help of broom package which
converts regressions to tibbles]
- mutate (add stats back to data frame on the fly)
Group and summarise
- We can calculate means and sd per group
Means<-TrustLong.Final %>%
group_by(Vision,Time) %>%
summarise(MeanTrust=mean(TrustFeeling),
SDTrust=sd(TrustFeeling))
Vision
|
Time
|
MeanTrust
|
SDTrust
|
NoSee
|
0
|
0.1451000
|
0.0596200
|
NoSee
|
1
|
0.1512333
|
0.0751312
|
NoSee
|
2
|
0.1717333
|
0.0662582
|
NoSee
|
3
|
0.2091000
|
0.0988226
|
NoSee
|
4
|
0.2591667
|
0.1212685
|
See
|
0
|
0.4141667
|
0.0504450
|
See
|
1
|
0.5525000
|
0.0525919
|
See
|
2
|
0.6191667
|
0.0801088
|
See
|
3
|
0.6183333
|
0.1228327
|
See
|
4
|
0.5150000
|
0.1749026
|
- we can also calculate a correlation per subject separately per
vision condition
library(broom)
CorrResult<-TrustLong.Final %>%
group_by(Vision,Subject) %>%
do(tidy(cor.test(.$Time, .$TrustFeeling)))
Vision
|
Subject
|
estimate
|
statistic
|
p.value
|
parameter
|
conf.low
|
conf.high
|
method
|
alternative
|
NoSee
|
1
|
0.1645258
|
0.2889041
|
0.7914681
|
3
|
-0.8396155
|
0.9141048
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
2
|
0.8261808
|
2.5398903
|
0.0846876
|
3
|
-0.2068905
|
0.9881634
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
3
|
0.9577867
|
5.7706177
|
0.0103452
|
3
|
0.4873004
|
0.9973063
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
4
|
0.7068877
|
1.7309782
|
0.1818874
|
3
|
-0.4660154
|
0.9787461
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
5
|
0.8234354
|
2.5135828
|
0.0866641
|
3
|
-0.2150959
|
0.9879596
|
Pearson’s product-moment correlation
|
two.sided
|
NoSee
|
6
|
0.9769363
|
7.9243892
|
0.0041901
|
3
|
0.6856064
|
0.9985416
|
Pearson’s product-moment correlation
|
two.sided
|
- or we can also run regression per subject separately per vision
condition
RegresResult<-TrustLong.Final %>%
group_by(Vision,Subject) %>%
do(tidy(lm(TrustFeeling ~ Time, data=.)))
Vision
|
Subject
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
NoSee
|
1
|
(Intercept)
|
0.1319
|
0.0135657
|
9.7230572
|
0.0023108
|
NoSee
|
1
|
Time
|
0.0016
|
0.0055382
|
0.2889041
|
0.7914681
|
NoSee
|
2
|
(Intercept)
|
0.1909
|
0.0131159
|
14.5548039
|
0.0007033
|
NoSee
|
2
|
Time
|
0.0136
|
0.0053546
|
2.5398903
|
0.0846876
|
NoSee
|
3
|
(Intercept)
|
0.0179
|
0.0312414
|
0.5729568
|
0.6068013
|
NoSee
|
3
|
Time
|
0.0736
|
0.0127543
|
5.7706177
|
0.0103452
|
- You mutate the dataset you just created and add a column created by
another column
RegresResult <-
RegresResult %>%
mutate(Sig = if_else(p.value < .05,1,0))
Vision
|
Subject
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
Sig
|
NoSee
|
1
|
(Intercept)
|
0.1319
|
0.0135657
|
9.7230572
|
0.0023108
|
1
|
NoSee
|
1
|
Time
|
0.0016
|
0.0055382
|
0.2889041
|
0.7914681
|
0
|
NoSee
|
2
|
(Intercept)
|
0.1909
|
0.0131159
|
14.5548039
|
0.0007033
|
1
|
NoSee
|
2
|
Time
|
0.0136
|
0.0053546
|
2.5398903
|
0.0846876
|
0
|
NoSee
|
3
|
(Intercept)
|
0.0179
|
0.0312414
|
0.5729568
|
0.6068013
|
0
|
NoSee
|
3
|
Time
|
0.0736
|
0.0127543
|
5.7706177
|
0.0103452
|
1
|
- Now you can take that new dataset and summarize it (mean slope
estimates just for Time by significant vs nonsignificant results)
[Note: dplyr::filter is used because its conflicts with function in
another package. Thus this forces filter function to come from the dplyr
package.]
RSum<-RegresResult %>%
group_by(Vision,Sig) %>%
dplyr::filter(term=="Time") %>%
summarise(N=length(Sig),
MeanSlope=mean(estimate))
Vision
|
Sig
|
N
|
MeanSlope
|
NoSee
|
0
|
8
|
0.0099750
|
NoSee
|
1
|
4
|
0.0658500
|
See
|
0
|
9
|
0.0071111
|
See
|
1
|
3
|
0.0856667
|
- Note: You can do this all at once because its conduct in the order
you call it
RegresFinal<-TrustLong.Final %>%
group_by(Vision,Subject) %>%
do(tidy(lm(TrustFeeling ~ Time, data=.))) %>%
mutate(Sig = if_else(p.value < .05,1,0)) %>%
group_by(Vision,Sig) %>%
dplyr::filter(term=="Time") %>%
summarise(N=length(Sig),
MeanSlope=mean(estimate))
- You can pipe all these results into a ggplot
ggplot
Inspired by “The Grammar of Graphics” Leland Wilkinson 1999
“Destined to become a landmark in statistical graphics, this book
provides a formal description of graphics, particularly static graphics,
playing much the same role for graphics as probability theory played for
statistics.”Journal of the American Statistical Association Former VP at
SPSS Inc. Founder of SYSTAT. Adjunct Professor of Statistics at
Northwestern University. He is also affiliated with the Computer Science
department at The University of Illinois at Chicago.
Grammar
- Data (data = ): Data.Frame to be
mapped
- Aesthetic mapping (aes( )): x = , y = ,
group = from your data. Can also add additional mapping into
the aes:
- (color = , shape = , size = , fill =
, alpha =, etc): These can be fixed values are variables
- Note: if place these calls into aes(x=IV, y=DV, color=subjectID),
then the subject will appear in your legend with those colors. You can
move some these maps into geom call to avoid that
- Geometric object (geom_): bar, point,
line, ribbons, shapes you want to graph based on your
as
- Position adjustments (position = ): goes with
geom_ call such as position_dodge (don’t overlap),
position_identity (leave as read in), position_jitter
(jitters data points in scatterplot)
- Statistical transformations (stats_): On the fly transforms
(such as averaging): can be used instead of geoms (Note: I
prefer to calculate stats outside of the plot when possible as its
easier to see what you are doing)
- Coordinate system (coord_): do you want to add
coord_cartesian(), coord_polar(), etc
- Scales (scale_) or simply (xlim, ylim). Override defaults
to control many aspects of the graphs
- Faceting (facet_): visual subsets: two options
grid or wrap
Layering
- First, you put on your jacket; then you put on your shoes, next
underwear, finally you shower, right?
- ggplot has very specific order that you should generally follow.
- what is my data, what are my mappings, what are geoms, how should I
position them, and last what do I want to do with the look
- The order of which you add calls is the order they appear. So later
calls will override earlier calls
Walkthrough
library(ggplot2)
G1<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))
G1
- Step 2: add geom
- This can help you look up all the types
help.search("geom_", package = "ggplot2")
G2<-G1+geom_point()
G2
G3<-G2+geom_smooth()
G3
- This line is loess, let’s make it lm with second order polynomial:
\(y = x+x^2\), which we can write at
poly(X,2)
[orthogonal power polynomials]. [Note: you
will not always need to call stats::poly, normally you can write just
poly, but it is conflicting with another function]
?geom_smooth
G3a<-G2+geom_smooth(method='lm', formula = y ~ stats::poly(x,2))
G3a
- Step 4: facet grid by Vision condition
G4a<-G3a+facet_grid(~Vision)
G4a
G4b<-G4a+theme_bw()
G4b
- Modify the theme to change the minor grid lines
G4c<-G4b+theme(panel.grid.minor = element_blank())
G4c
- Change the x and y labels
G4d<-G4c+xlab("Time Step")+ylab("Trust Score")
G4d
- Override the geom_smooth with a different color line and set SE to
false
G4e<-G4d+geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')
G4e
Poly.Plot<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())
Poly.Plot
Speggetti Plot
- We can add best fit line per subject, we just need to make the
subject a grouping variable
- aes(group=Subject)
- This can be added either at geom_smooth if you just want it to apply
to that geom such as below
Speg.Plot<-ggplot(data = TrustLong.Final, aes(x = Time , y = TrustFeeling))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(aes(group=Subject),method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())+
ggtitle("Trust Study")
Speg.Plot
- or you can apply it the over-arching aes
Speg.Plot.2<-ggplot(data = TrustLong.Final,
aes(x = Time , y = TrustFeeling,group=Subject))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2), color='red')+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())+
ggtitle("Trust Study")
Speg.Plot.2
- now you applied to the over-arching aes you can do other fun stuff,
like color each data point and line relative the subject
- Notice below where I added: aes(color=Subject)
Speg.Plot.3<-ggplot(data = TrustLong.Final,
aes(x = Time, y = TrustFeeling, group=Subject, color=Subject))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())
Speg.Plot.3
- What why is the color continuous?
str(TrustLong.Final)
## 'data.frame': 120 obs. of 5 variables:
## $ Subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Personality : num 0.0223 0.6669 0.7458 0.8527 0.6066 ...
## $ Vision : chr "NoSee" "NoSee" "NoSee" "NoSee" ...
## $ Time : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TrustFeeling: num 0.1301 0.2051 0.0601 0.2101 0.1451 ...
- Because Subject is interger
TrustLong.Final$Subject.F<-as.factor(TrustLong.Final$Subject)
- Remake the plot and replace Subject with Subject.F
- also we can add Shape and linetype by Subject (in case we have to
print in black and white)
Speg.Plot.4<-ggplot(data = TrustLong.Final,
aes(x = Time, y = TrustFeeling, group=Subject.F, color=Subject.F,shape=Subject.F,linetype=Subject.F))+
facet_grid(~Vision)+
geom_point()+
geom_smooth(method='lm', se=FALSE, formula = y ~ stats::poly(x,2))+
xlab("Time Step")+ylab("Trust Score")+
theme_bw()+
theme(panel.grid.minor = element_blank())
Speg.Plot.4
Speg.Plot.4a<-Speg.Plot.4+theme(legend.position = "none")
Speg.Plot.4a
- And you can make it totally “far out”
- You can use HTML or color labels
Speg.Plot.4a + theme(plot.background = element_rect(size = 1, color = "blue", fill = "purple"),
text=element_text(size = 12, family = "Serif", color = "ivory"),
axis.text.y = element_text(colour = "magenta"),
axis.text.x = element_text(colour = "green"),
panel.background = element_rect(fill = "pink"),
strip.background = element_rect(fill = "#ccff66"))
Tidyr & ggplot fun time
- Just for fun lets run the polynomial regressions and extract the
\(R^2\) per subject for Vision
TrustLong.Final$Subject.F<-as.factor(TrustLong.Final$Subject)
PolyRegress<-TrustLong.Final %>%
group_by(Vision, Subject.F) %>%
do(glance(lm(TrustFeeling ~ stats::poly(Time,2), data=.))) %>%
select(Vision,r.squared)
Subject.F
|
Vision
|
r.squared
|
1
|
NoSee
|
0.2266071
|
2
|
NoSee
|
0.9536065
|
3
|
NoSee
|
0.9938067
|
4
|
NoSee
|
0.5280375
|
5
|
NoSee
|
0.9600112
|
6
|
NoSee
|
0.9997217
|
- Let’s extract the Personality score and join it to that new
dataset
PersonalityScore<-TrustLong.Final %>%
dplyr::filter(Time==0) %>%
select(Subject.F,Personality, Vision)
MergedData<-left_join(PolyRegress,PersonalityScore)
Subject.F
|
Vision
|
r.squared
|
Personality
|
1
|
NoSee
|
0.2266071
|
0.0222995
|
2
|
NoSee
|
0.9536065
|
0.6669353
|
3
|
NoSee
|
0.9938067
|
0.7457592
|
4
|
NoSee
|
0.5280375
|
0.8527015
|
5
|
NoSee
|
0.9600112
|
0.6066250
|
6
|
NoSee
|
0.9997217
|
2.1633908
|
- Scatter plot by Vision with fancy labels
Scatter.plot<-ggplot(data = MergedData, aes(Personality,r.squared))+
geom_point(aes(shape=Vision))+
xlab("Personaility Score")+
ylab(expression(paste("Polynomial Regression, ", R^{2},' Result Per Subject')))+
theme_minimal()
Scatter.plot
- Reorder and rename labels
MergedData$Vision.O <- factor(MergedData$Vision,
levels = c("See","NoSee"),
labels = c("Partner is Visable","Partner is obscured"))
- Add new labels and fix up legend
Scatter.plot2<-ggplot(data = MergedData, aes(Personality,r.squared))+
geom_point(aes(shape=Vision.O),size = 2.5, stroke = 1.25)+
xlab("Personaility Score")+
ylab(expression(paste("Polynomial Regression, ", R^{2},' Result Per Subject')))+
theme_minimal()+theme(legend.position = "top",
legend.text = element_text(size = 11, color = "gray50"),
legend.title=element_blank())
Scatter.plot2
Scatter.plot3<-Scatter.plot2+scale_shape_manual(values=c(21,24))
Scatter.plot3
- Let’s add labels so we know which dots are which subjects
library("ggrepel")
Scatter.plot3 + scale_shape_manual(values=c(21,24))+
geom_text_repel(aes(label=Subject.F), size = 3)
- Label only a hand full of subjects
pointsToLabel <- c("1","6","12")
Scatter.plot4<-Scatter.plot3 + scale_shape_manual(values=c(21,24))+
geom_text_repel(aes(label = Subject.F),
color = "gray20",
data = subset(MergedData, Subject.F %in% pointsToLabel),
force = 10)
Scatter.plot4
Scatter.plot5<- Scatter.plot4+scale_x_continuous(name = "Personality Score, Judging (1=Most)",
limits = c(0.0, 2.5),
breaks = seq(0.0, 2.5, by = 0.25))
Scatter.plot5
- Add complex regression to the plot plot
Scatter.plot6 <-Scatter.plot5 +
geom_smooth(aes(linetype = Vision.O, group=Vision.O),
method = "lm",
formula = y ~ log(x), se = FALSE,
color = "green")
Scatter.plot6
