# 1 Correlations

• Everything correlates with everything (Ambient Correlational Noise).
• Our goal today is to determine how much
• Deal with 2 variables: Next we explore the problems with 3 or more

Correlations between two variables

• Pearson’s (interval by interval)
• Spearman’s (interval by ordinal)
• Point-by-serial (interval by dichotomous)
• Polychoric (ordinal vs ordinal)
• Tetrachroic (dichotomous vs dichotomous)

## 1.1 Pearson’s Correlation

$$r_{xy}=\frac{\sum{xy}}{\sqrt{\sum{x^2}\sum{y^2}}}$$

• Numerator = How much they vary together (co-variance)
• Denominator = How much they vary alone (variance)
• values is bounded between -1 and 1
• Let’s create two random normal variables that correlate with each other
• to do this we need to make a covariance matrix with a correlation at .6
#packages we will need to conduct to create and graph our data
library(MASS) #create data
library(car) #graph data
r=.6 #Correlation value
Means.XY<- c(10,10) #set the means of X and Y variables
CovMatrix.XY <- matrix(c(1,r,r,1),2,2) # creates the covariate matrix

CovMatrix.XY #nice and simple 2x2 matrix
##      [,1] [,2]
## [1,]  1.0  0.6
## [2,]  0.6  1.0
#build the correlated variables.
# Note: empirical=TRUE means make the correlation EXACTLY r.
# empirical=FALSE, the correlation value would be normally distributed around r
CorrData<-mvrnorm(n=100, mu=Means.XY,Sigma=CovMatrix.XY, empirical=TRUE)

#Convert them to a "Data.Frame", which is like SPSS data window
CorrData<-as.data.frame(CorrData)
#lets add our labels to the vectors we created
colnames(CorrData) <- c("Happiness","IceCream")
#Lets view the first few subjects
head(CorrData)
##   Happiness  IceCream
## 1 10.774760 10.907732
## 2  8.700742  8.816727
## 3  9.722170  9.617680
## 4 10.927843 10.499182
## 5 11.584289 10.692111
## 6  9.641194  9.557827
#make the scatter plot
scatterplot(Happiness~IceCream,CorrData, smoother=FALSE)

#calculate the Pearson's correlation
cor.test(CorrData$Happiness, CorrData$IceCream,
method = c("pearson"))
##
##  Pearson's product-moment correlation
##
## data:  CorrData$Happiness and CorrData$IceCream
## t = 7.4246, df = 98, p-value = 4.193e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4574985 0.7124547
## sample estimates:
## cor
## 0.6

### 1.1.1 Pearson’s correlation is scale independent!

• No matter the mean differences or range of scores, the Pearson’s r will give the same results.
• We can also z-score the data and get the same result
• However if they are scaled non-linearly (sqrt, ^2, log,…) the correlation will change.
#lets add (change the mean)
Happiness.big<-CorrData$Happiness+1000 IceCream<-CorrData$IceCream
cor.test(Happiness.big, IceCream, method = c("pearson"))
##
##  Pearson's product-moment correlation
##
## data:  Happiness.big and IceCream
## t = 7.4246, df = 98, p-value = 4.193e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4574985 0.7124547
## sample estimates:
## cor
## 0.6
#zscored
Happiness.z<-scale(CorrData$Happiness) IceCream.z<-scale(CorrData$IceCream)
cor.test(Happiness.z, IceCream.z, method = c("pearson"))
##
##  Pearson's product-moment correlation
##
## data:  Happiness.z and IceCream.z
## t = 7.4246, df = 98, p-value = 4.193e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4574985 0.7124547
## sample estimates:
## cor
## 0.6
#what happens if I LINEARLY scale them differently?
cor.test(Happiness.big, IceCream.z, method = c("pearson"))
##
##  Pearson's product-moment correlation
##
## data:  Happiness.big and IceCream.z
## t = 7.4246, df = 98, p-value = 4.193e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4574985 0.7124547
## sample estimates:
## cor
## 0.6
#what happens if I NON-LINEARLY scale them differently?
Happiness<-CorrData$Happiness # orginal IceCream.sq4<-(CorrData$IceCream)^4 #Non-linear

scatterplot(Happiness~IceCream.sq4, smoother=FALSE)

cor.test(Happiness, IceCream.sq4, method = c("pearson"))
##
##  Pearson's product-moment correlation
##
## data:  Happiness and IceCream.sq4
## t = 6.6871, df = 98, p-value = 1.409e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4082241 0.6812735
## sample estimates:
##       cor
## 0.5597593

### 1.1.2 Pearson’s: Let visualize our result

• Overlap between the two variables is defined by $$r^2$$
library(VennDiagram) # lets us plot our results (like the book)
## Loading required package: grid
## Loading required package: futile.logger
##
## Attaching package: 'VennDiagram'
## The following object is masked from 'package:car':
##
##     ellipse
# calculate r-squared
overlap=r^2

Simple.Corr.Venn<-draw.pairwise.venn(1, 1, overlap, c("Happiness", "IceCream"))
grid.draw(Simple.Corr.Venn)

## 1.2 Spearmen’s Correlation

• Spearman is a Pearson Correlation on rank ordered data.
• Let’s rank order our random correlated data
• You have to rank each variable independently first
CorrData$X.rank<-rank(CorrData$Happiness)
CorrData$Y.rank<-rank(CorrData$IceCream)

scatterplot(Y.rank~X.rank,CorrData, smoother=FALSE)

cor.test(CorrData$Y.rank, CorrData$X.rank, method = c("pearson"))
##
##  Pearson's product-moment correlation
##
## data:  CorrData$Y.rank and CorrData$X.rank
## t = 6.7079, df = 98, p-value = 1.278e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4096673 0.6822010
## sample estimates:
##       cor
## 0.5609481
# Or you can just use the built in spearman correlation and get the same result
cor.test(CorrData$Y, CorrData$X, method = c("spearman"))
##
##  Spearman's rank correlation rho
##
## data:  CorrData$Y and CorrData$X
## S = 73168, p-value = 1.923e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho
## 0.5609481

## 1.3 Point-by-Serial

• This time lets make up some data on the fly
library(polycor) #Advanced Correlations
##
## Attaching package: 'polycor'
## The following object is masked from 'package:psych':
##
##     polyserial
set.seed(42)
ratings<-c(rnorm(25,mean=5,sd = .5),rnorm(25,mean=2,sd = .5))
Flavors<-c(rep(0,25),c(rep(1,25)))

#Build data frame
Ice.Cream.Data<-data.frame(
ratings = ratings,
Flavors = Flavors,
Names = FlavorNames)
head(Ice.Cream.Data)
##    ratings Flavors        Names
## 1 5.685479       0 cookie Dough
## 2 4.717651       0 cookie Dough
## 3 5.181564       0 cookie Dough
## 4 5.316431       0 cookie Dough
## 5 5.202134       0 cookie Dough
## 6 4.946938       0 cookie Dough
scatterplot(ratings~Flavors,Ice.Cream.Data, smoother=FALSE)

polyserial(Ice.Cream.Data$Flavors,Ice.Cream.Data$ratings)
## [1] -0.8868748

# 2 Regression

• Correlation and regression are similar
• Correlation determines the standardized relationship between X and Y
• Linear regression = 1 DV and 1 IV, where the relationship is a straight line
• Linear regression determines how X predicts Y
• Multiple (linear) regression = 1 DV and 2+ IV (also straight lines)
• Multiple regression determines how X,z, and etc, predict Y [next week]

## 2.1 Basic Regression Equation

• Linear Regression equation you learned when younger was probably $$y = MX + b$$
• $$y$$ = predict value
• $$M$$ = slope
• $$X$$ = Variable used to predict Y
• $$b$$ = intercept

## 2.2 Modern Regression Equation

• $$Y=B_{YX}X + B_0 + e$$
• $$Y$$ = predict value
• $$B_{YX}$$ = slope
• $$B_{0}$$ = intercept
• $$e$$ = error term (observed - predicted). Also called the residual.

## 2.3 Ice cream example

• Specify the model with the lm function.
• We are going to predict happiness scores from ice cream!
scatterplot(Happiness~IceCream, smoother=FALSE)

Happy.Model.1<-lm(Happiness~IceCream,data = CorrData)
summary(Happy.Model.1)
##
## Call:
## lm(formula = Happiness ~ IceCream, data = CorrData)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -2.54020 -0.62443 -0.01595  0.61440  1.93532
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4.00000    0.81211   4.925 3.42e-06 ***
## IceCream     0.60000    0.08081   7.425 4.19e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8041 on 98 degrees of freedom
## Multiple R-squared:   0.36,  Adjusted R-squared:  0.3535
## F-statistic: 55.13 on 1 and 98 DF,  p-value: 4.193e-11

### 2.3.1 Intercept

• 4 is where the line hit the y-intercept (when happiness = 0).

### 2.3.2 Slope

• 0.6 is the rise over run
• for each 0.6 change in ice cream value, there is a corresponding change in happiness!
• so we can predict happiness from ice cream score:

(0.6 * 5 spoons of ice cream + baseline happiness intercept: 4) = 7

• This is your predicted happiness score if you had 5 spoons of ice cream

### 2.3.3 Error for this prediction?

R will actually do all the prediction for us for each value of ice cream residuals = observed - predicted

• Red dots = observed above predictor line
• Blue dots = observed below predictor line
• the stronger the color the more an impact that point has in pulling the line in its direction
• Hollow dots = predicted
• The gray lines are the distance between observed and predicted values!

What should the mean of the residuals equal?

CorrData$predicted <- predict(Happy.Model.1) # Save the predicted values with our real data CorrData$residuals <- residuals(Happy.Model.1) # Save the residual values

library(ggplot2) # Note: a more powerful graphing tool I am forced to use to show this to you
ggplot(data = CorrData, aes(x = IceCream, y = Happiness)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  # Plot regression slope
geom_point(aes(color = residuals)) +  # Color mapped here
scale_color_gradient2(low = "blue", mid = "white", high = "red") +  # Colors to use here
guides(color = FALSE) +
geom_segment(aes(xend = IceCream, yend = predicted), alpha = .2) +  # alpha to fade lines
geom_point(aes(y = predicted), shape = 1) +
theme_bw()  # Add theme for cleaner look

### 2.3.4 Ordinary least squares (OLS)

• Linear regression finds the best fit line by trying to minimize the sum of the squares of the differences between the observed responses those predicted by the line.
• OLS computationally simple to get the slope value, but is actually inaccurate

$B_{YX}=\frac{\sum{XY}-\frac{1}{n}\sum{X}\sum{Y}}{\sum{x^2}-\frac{1}{n}\sum{x}^2} = \frac{Cov_{XY}}{var_x}$

• Modern methods use an alternative (ML, REML) we will examine later when we get to GLM

## 2.4 SE on the terms in the models (how good is the fit?)

• Residual Standard error = $$\sqrt\frac{\sum{e^2}}{n-2}$$
• in R language:
n=length(CorrData$residuals) RSE = sqrt(sum(CorrData$residuals^2) / (n-2))
RSE
## [1] 0.8040713
• So our error on the prediction is 0.8040713 happiness points based on our model.

### 2.4.1 SE on the Intercept

• Intercept Standard error = $$RSE\sqrt {\frac{1}{n}+\frac{M_x^2}{(n-1)var_x}}$$
• in R language:
ISE = RSE*(sqrt( 1 / n + mean(CorrData$IceCream)^2 / (n - 1)*var(CorrData$IceCream)))
ISE
## [1] 0.8121124

### 2.4.2 SE on the Slope

• Intercept Standard error = $$\frac{sd_y}{sd_x}\sqrt{\frac{1 - r_YX^2}{n-2}}$$
• in R language:
#lets extract the r2 from the model
r2.model<-summary(Happy.Model.1)$r.squared SSE = sd(CorrData$Happiness)/sd(CorrData$IceCream) * sqrt((1- r2.model)/ (n - 2)) SSE ## [1] 0.0808122 ### 2.4.3 t-tests on slope and intercept and $$r^2$$ value • Values are tested against 0, so its all one sample t-tests • slope: $$t = \frac{B_{YX} - H_0}{SE_{B_{YX}}}$$ • intercept: $$t = \frac{B_{0} - H_0}{SE_{B_{0}}}$$ $$r^2$$ is a little different as its a correlation value • correlations are not normally distributed • Fisher created a conversion for r to make it a z (called Fishers’ $$r$$ to $$Z$$) • $$r^2$$: $$t = \frac{r_{XY}\sqrt{n-2}-H_0}{\sqrt{1-r_{XY}^2}}$$ , where df = n - 2 • its often given for as an F value, remember $$t = F^2$$ #intercept t.I= Happy.Model.1$coefficients[1]/ISE
t.I
## (Intercept)
##    4.925427
#Slope
t.S= Happy.Model.1\$coefficients[2]/SSE
t.S
## IceCream
## 7.424621
# For r-squared
t.r2xy = r2.model^.5*sqrt(n-2)/sqrt(1-r2.model)
F.r2xy = t.r2xy^2
F.r2xy
## [1] 55.125

Note: We are testing null hypothesis value for slope, i.e., null = 0. But it’s a terrible guess. Everything correlates with everything, so it’s important to keep this in mind moving forward. So that would be the NILL hypothesis. NILL can be tested better with bootstrapping.

# 3 Power and Regression

• We studied power last semester
• I will skip ahead to calculating it in R.
• for regression, we will need to convert our $$r^2$$ into cohen’s $$f^2$$
• $$f^2 = \frac{r^2}{1-r^2}$$
library(pwr) #power analysis
#power for GLM
# u  = degrees of freedom for numerator
# v = degrees of freedom for denominator
# f2 = effect size
# sig.level= (Type I error probability)
# power = (1 minus Type II error probability)
f2.icecream <- r^2 / (1-r^2)

pwr.f2.test(u = 1, v = n-2, f2 = f2.icecream, sig.level = 0.05, power = NULL)
##
##      Multiple regression power calculation
##
##               u = 1
##               v = 98
##              f2 = 0.5625
##       sig.level = 0.05
##           power = 1

So we had a power of basically 1 given this sample size and our true effect size of 0.6

# 4 Final Notes

• Testing between correlations using old fashion Fisher’s test is old fashion (Cohen et al., p. 49). The modern approach is the bootstrap, as the old method is under-powered.
• Your book is little out of date: CIs are better but bootstrapped CIs are becoming more standard.