1 Missingness

1.1 Why is it missing?

  • MCAR (Missing Completely At Random): “There’s no relationship between whether a data point is missing and any values in the data set, missing or observed.”

  • MAR (Missing At Random): “means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data”. In other words, the response is missing because of another question you asked. You asked me if I like spiders, I said no. Next you ask me if I have a pet spider. I dont respond.

  • MNAR (Missing Not At Random): “means the propensity for a data point to be missing is related to the missing data”. You have no idea why the response/question is missing, thus you cannot infer it.

Best MCAR. Worst is MNAR.

1.2 How big is my data set

  • Normal to Huge data set: Losing 5-10% of your data is not a problem. Move on in life. If larger, then you have to start thinking about missing analysis and replacement.
  • Very small data set (5-10 people per IV): Tea leaf reading or voodoo.

2 Common methods to Impute Missing data

  • Impute: “assign (a value) to something by inference from the value of the products or processes to which it contributes”.

2.1 Classical Methods (assumining MCAR)

  • Listwise deletion: Removes whole subject is one data point is missing. Easy to implement, but losses alot of data.
  • Pairwise deletion: Removes just that case of that subjects data (keeps most data). Can be to implement (can mess up the error term or make DF differ from model to model), but less data is lost.
  • Mean imputation: Fill in the missing data points with the mean of that variable. Lowers variance overall and keeps subject. No data loss, but you are affecting the error term and keeping the DF. This can create problems in very small samples.
  • Conditional mean imputation: Replaces missing values with the predicted scores from a linear regression equation. [Very complicated when MAR is an issue]

  • Most common in psychology tend to be Mean or Pair- or listwise (SPSS defaults depending on analysis. R defaults to listwise in most cases)

  • For a review of the theory see: Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research, 74, 525-556.

2.1.1 Classical Methods

  • We will do the listwise and mean replacement types (the conditions means can be done with the mice package we well)
  • Simulate a simple regression - remove cases completely at random and test

2.1.1.1 Typical Sample Size

library(car)
set.seed(42)
n <- 50
# IVs
X <- runif(n, -10, 10)
Z <- rnorm(n, -10, 10)
# Our equation to  create Y
Y <- .8*X - .4*Z + 2 + rnorm(n, sd=10)
#Built our data frame
DataSet1<-data.frame(DV=Y,IV1=X,IV2=Z)
  • Remove IVs and DVs at random (ampute) at about 50%
library(mice)
set.seed(666)
Amputed.Data<-ampute(data =DataSet1, prop = 0.5, mech = "MCAR")
DataSet1.M<-Amputed.Data$amp
  • Lets visualize the missingness
library(VIM)
aggr(DataSet1.M, col=c('navyblue','red'), numbers=TRUE, 
     sortVars=TRUE, labels=names(data), cex.axis=.7, 
     gap=3, 
     ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##       IV1  0.24
##       IV2  0.20
##        DV  0.14
  • Lets visualize pairwise scatterplots based on where the missing data seems to be
marginplot(DataSet1.M[c(1,2)])

marginplot(DataSet1.M[c(2,3)])

marginplot(DataSet1.M[c(1,3)])

  • Run the regression of the original and missing data side by side
  • R defaults to listwise deletion (most conservative approach)
  • Pairwise-deletion regression is not build into R by default (but there is is pairwise deletion for other methods)
  • Original, Listwise, mean imputed (using the mice package)
Orginal<-lm(DV~IV1+IV2, data= DataSet1)
Listwise<-lm(DV~IV1+IV2, data= DataSet1.M)

# Mean impute using MICE 
DataSet1.MM<- complete(mice(DataSet1.M, meth='mean',printFlag = FALSE))
MeanImpute<-lm(DV~IV1+IV2, data= DataSet1.MM)

library(stargazer)
stargazer(Orginal,Listwise,MeanImpute,type="html",
          column.labels = c("Orginal", "Listwise","Mean Impute"),
          intercept.bottom = FALSE,
          single.row=FALSE, 
          notes.append = FALSE,
          header=FALSE)
Dependent variable:
DV
Orginal Listwise Mean Impute
(1) (2) (3)
Constant 2.612 5.068 3.370*
(1.909) (3.138) (1.841)
IV1 0.133 0.030 0.021
(0.209) (0.333) (0.225)
IV2 -0.428*** -0.267 -0.416***
(0.131) (0.197) (0.132)
Observations 50 21 50
R2 0.192 0.098 0.176
Adjusted R2 0.158 -0.002 0.140
Residual Std. Error 8.876 (df = 47) 9.336 (df = 18) 8.223 (df = 47)
F Statistic 5.590*** (df = 2; 47) 0.980 (df = 2; 18) 5.004** (df = 2; 47)
Note: p<0.1; p<0.05; p<0.01

2.1.1.2 Huge Data Set

  • N = 500
set.seed(42)
n <- 500
# IVs
X <- runif(n, -10, 10)
Z <- rnorm(n, -10, 10)

# Our equation to  create Y
Y <- .8*X - .4*Z + 2 + rnorm(n, sd=10)

#Built our data frame
DataSet2<-data.frame(DV=Y,IV1=X,IV2=Z)

#remove cases at random 
set.seed(666)
Amputed.Data.2<-ampute(data =DataSet2, prop = 0.5, mech = "MCAR")
DataSet2.M<-Amputed.Data.2$amp
Orginal.2<-lm(DV~IV1+IV2, data= DataSet2)
Listwise.2<-lm(DV~IV1+IV2, data= DataSet2.M)

# Mean impute using MICE 
DataSet2.MM<- complete(mice(DataSet2.M, meth='mean',printFlag = FALSE))
MeanImpute.2<-lm(DV~IV1+IV2, data= DataSet2.MM)

stargazer(Orginal.2,Listwise.2,MeanImpute.2,type="html",
          column.labels = c("Orginal", "Listwise","Mean Impute"),
          intercept.bottom = FALSE,
          single.row=FALSE, 
          notes.append = FALSE,
          header=FALSE)
Dependent variable:
DV
Orginal Listwise Mean Impute
(1) (2) (3)
Constant 2.249*** 1.769* 3.066***
(0.680) (0.969) (0.692)
IV1 0.817*** 0.839*** 0.700***
(0.079) (0.111) (0.084)
IV2 -0.378*** -0.392*** -0.285***
(0.047) (0.068) (0.050)
Observations 500 263 500
R2 0.258 0.250 0.168
Adjusted R2 0.255 0.244 0.165
Residual Std. Error 10.368 (df = 497) 10.390 (df = 260) 10.066 (df = 497)
F Statistic 86.263*** (df = 2; 497) 43.386*** (df = 2; 260) 50.286*** (df = 2; 497)
Note: p<0.1; p<0.05; p<0.01

2.1.1.3 Tiny Sample Size

  • N = 15
set.seed(42)
n <- 15
# IVs
X <- runif(n, -10, 10)
Z <- rnorm(n, -10, 10)

# Our equation to  create Y
Y <- .8*X - .4*Z + 2 + rnorm(n, sd=10)

#Built our data frame
DataSet3<-data.frame(DV=Y,IV1=X,IV2=Z)

#remove cases at random 
set.seed(666)
Amputed.Data.3<-ampute(data =DataSet3, prop = 0.5, mech = "MCAR")
DataSet3.M<-Amputed.Data.3$amp
Orginal.3<-lm(DV~IV1+IV2, data= DataSet3)
Listwise.3<-lm(DV~IV1+IV2, data= DataSet3.M)

# Mean impute using MICE 
DataSet3.MM<- complete(mice(DataSet3.M, meth='mean',printFlag = FALSE))
MeanImpute.3<-lm(DV~IV1+IV2, data= DataSet3.MM)

stargazer(Orginal.3,Listwise.3,MeanImpute.3,type="html",
          column.labels = c("Orginal", "Listwise","Mean Impute"),
          intercept.bottom = FALSE,
          single.row=FALSE, 
          notes.append = FALSE,
          header=FALSE)
Dependent variable:
DV
Orginal Listwise Mean Impute
(1) (2) (3)
Constant 3.822 1.437 0.169
(3.024) (8.258) (3.093)
IV1 1.310** 0.892 1.157*
(0.526) (1.249) (0.548)
IV2 -0.225 -0.163 -0.703**
(0.272) (1.212) (0.267)
Observations 15 7 15
R2 0.405 0.121 0.429
Adjusted R2 0.306 -0.319 0.334
Residual Std. Error 9.768 (df = 12) 13.827 (df = 4) 8.786 (df = 12)
F Statistic 4.081** (df = 2; 12) 0.274 (df = 2; 4) 4.506** (df = 2; 12)
Note: p<0.1; p<0.05; p<0.01

2.1.1.4 Typical Samples with MNAR

  • N = 50
#remove cases at random from our orginal data set
set.seed(666)
Amputed.Data.4<-ampute(data =DataSet1, prop = 0.5, mech = "MNAR")
DataSet4.M<-Amputed.Data.4$amp
Orginal.4<-lm(DV~IV1+IV2, data= DataSet1)
Listwise.4<-lm(DV~IV1+IV2, data= DataSet4.M)

# Mean impute using MICE 
DataSet4.MM<- complete(mice(DataSet4.M, meth='mean',printFlag = FALSE))
MeanImpute.4<-lm(DV~IV1+IV2, data= DataSet4.MM)

stargazer(Orginal.4,Listwise.4,MeanImpute.4,type="html",
          column.labels = c("Orginal", "Listwise","Mean Impute"),
          intercept.bottom = FALSE,
          single.row=FALSE, 
          notes.append = FALSE,
          header=FALSE)
Dependent variable:
DV
Orginal Listwise Mean Impute
(1) (2) (3)
Constant 2.612 1.071 2.196
(1.909) (2.621) (1.714)
IV1 0.133 -0.185 -0.008
(0.209) (0.327) (0.211)
IV2 -0.428*** -0.412** -0.390***
(0.131) (0.167) (0.118)
Observations 50 24 50
R2 0.192 0.232 0.189
Adjusted R2 0.158 0.159 0.155
Residual Std. Error 8.876 (df = 47) 9.254 (df = 21) 7.816 (df = 47)
F Statistic 5.590*** (df = 2; 47) 3.172* (df = 2; 21) 5.477*** (df = 2; 47)
Note: p<0.1; p<0.05; p<0.01

2.1.1.5 Typical Samples with MAR

  • N = 50
#remove cases from our orginal data set
set.seed(666)
Amputed.Data.5<-ampute(data =DataSet1, prop = 0.5, mech = "MAR")
DataSet5.M<-Amputed.Data.5$amp
Orginal.5<-lm(DV~IV1+IV2, data= DataSet1)
Listwise.5<-lm(DV~IV1+IV2, data= DataSet5.M)

# Mean impute using MICE 
DataSet5.MM<- complete(mice(DataSet5.M, meth='mean',printFlag = FALSE))
MeanImpute.5<-lm(DV~IV1+IV2, data= DataSet5.MM)

stargazer(Orginal.5,Listwise.5,MeanImpute.5,type="html",
          column.labels = c("Orginal", "Listwise","Mean Impute"),
          intercept.bottom = FALSE,
          single.row=FALSE, 
          notes.append = FALSE,
          header=FALSE)
Dependent variable:
DV
Orginal Listwise Mean Impute
(1) (2) (3)
Constant 2.612 -1.810 2.752
(1.909) (2.196) (2.042)
IV1 0.133 0.032 0.102
(0.209) (0.244) (0.236)
IV2 -0.428*** -0.491*** -0.387**
(0.131) (0.141) (0.148)
Observations 50 27 50
R2 0.192 0.339 0.128
Adjusted R2 0.158 0.284 0.090
Residual Std. Error 8.876 (df = 47) 7.364 (df = 24) 8.586 (df = 47)
F Statistic 5.590*** (df = 2; 47) 6.158*** (df = 2; 24) 3.437** (df = 2; 47)
Note: p<0.1; p<0.05; p<0.01

2.1.2 Notes

  • Mean imputing works OK in medium to large sample sizes
  • But it creates large bias in small samples
  • If you don’t have MCAR you have to turn to more complex methods

2.2 Modern approaches

  • There are many modern approaches often based on maximum likelihood estimation (ML) or Multiple Imputation (MI) or even Bayesian approaches
  • “ML estimation is to identify the population parameter values most likely to have produced a particular sample of data. This usually requires an iterative process whereby the model fitting program”tries out" different values for the parameters of interest (e.g., regression coefficients) en route to identifying the values most likely to have produced the sample data." (Peugh & Enders, 2004)
  • Multiple Imputation: “Rather than treating a single set of imputed values as”true" estimates of the missing values, MI creates a number of imputed data sets (frequently between 5 and 10), each of which contains a different plausible estimate of the missing values." (Peugh & Enders, 2004)
  • Lets test out MI as its the newest and considered an improvement over ML

2.3 Multiple Imputation by Chained Equations

  • See Azur et al, 2011 for details
  • MICE package in R: Lots of old and modern algorithms to choose from (also for design beyond linear regression)
  • Predictive mean matching is a good version to use as it “is a general purpose semi-parametric imputation method. … imputations are restricted to the observed values and that it can preserve non-linear relations…” (Buuren & Groothuis-Oudshoorn, 2011)
  • We will work from our MAR missing Dataset5.M
  • We can compare this to our classical methods and some other methods which the package allows (Bootstapping & Bayesian)

2.3.1 Typical sample size

#Linear regression using bootstrap
DataDataSet5.Boot<- complete(mice(DataSet5.M, m = 10,meth='norm.boot',printFlag = FALSE, seed = 666))
Boot<-lm(DV~IV1+IV2, data= DataDataSet5.Boot)

#Bayesian linear regression
DataDataSet5.BLR<- complete(mice(DataSet5.M, m = 10,meth='norm',printFlag = FALSE, seed = 666))
BLR<-lm(DV~IV1+IV2, data= DataDataSet5.BLR)

#Predictive mean matching 
DataDataSet5.pmm<- complete(mice(DataSet5.M, m = 10,meth='pmm',printFlag = FALSE, seed = 666))
PMM<-lm(DV~IV1+IV2, data= DataDataSet5.pmm)

stargazer(Orginal.5, Boot, BLR, PMM,type="html",
          column.labels = c("Orginal","Boot","BLR","PMM"),
          intercept.bottom = FALSE,
          single.row=FALSE, 
          notes.append = FALSE,
          header=FALSE)
Dependent variable:
DV
Orginal Boot BLR PMM
(1) (2) (3) (4)
Constant 2.612 2.198 2.146 2.467
(1.909) (1.938) (2.088) (2.099)
IV1 0.133 0.356 0.133 0.187
(0.209) (0.223) (0.224) (0.239)
IV2 -0.428*** -0.440*** -0.361*** -0.392***
(0.131) (0.134) (0.134) (0.143)
Observations 50 50 50 50
R2 0.192 0.203 0.136 0.140
Adjusted R2 0.158 0.169 0.099 0.104
Residual Std. Error (df = 47) 8.876 8.544 9.088 9.041
F Statistic (df = 2; 47) 5.590*** 5.970*** 3.702** 3.839**
Note: p<0.1; p<0.05; p<0.01
  • Predictive mean matching is the default of this program and you see it does a good job

2.3.2 Small sample size

  • Use the same exact small data (N = 15)
#Linear regression using bootstrap
DataDataSet3.Boot<- complete(mice(DataSet3.M, m = 15,meth='norm.boot',printFlag = FALSE, seed = 666))
Boot.3<-lm(DV~IV1+IV2, data= DataDataSet3.Boot)

#Bayesian linear regression
DataDataSet3.BLR<- complete(mice(DataSet3.M, m = 15,meth='norm',printFlag = FALSE, seed = 666))
BLR.3<-lm(DV~IV1+IV2, data= DataDataSet3.BLR)

#Predictive mean matching 
DataDataSet3.pmm<- complete(mice(DataSet3.M, m = 15,meth='pmm',printFlag = FALSE, seed = 666))
PMM.3<-lm(DV~IV1+IV2, data= DataDataSet3.pmm)

stargazer(Orginal.3, Boot.3, BLR.3, PMM.3,type="html",
          column.labels = c("Orginal","Boot","BLR","PMM"),
          intercept.bottom = FALSE,
          single.row=FALSE, 
          notes.append = FALSE,
          header=FALSE)
Dependent variable:
DV
Orginal Boot BLR PMM
(1) (2) (3) (4)
Constant 3.822 -2.926 1.351 -1.584
(3.024) (4.232) (3.624) (3.647)
IV1 1.310** 1.352* 1.299** 0.971
(0.526) (0.700) (0.541) (0.581)
IV2 -0.225 -1.026** -0.467 -0.991**
(0.272) (0.383) (0.274) (0.365)
Observations 15 15 15 15
R2 0.405 0.375 0.378 0.382
Adjusted R2 0.306 0.270 0.274 0.278
Residual Std. Error (df = 12) 9.768 9.319 9.680 9.394
F Statistic (df = 2; 12) 4.081** 3.595* 3.646* 3.701*
Note: p<0.1; p<0.05; p<0.01
  • We need to find practicer of the dark arts, as none of the methods did a good job

2.3.2.1 Notes

  • You have to manually set the number of iterations (number of passes)
  • You will have to research each algorithm for your particular data missingness

3 Best Practices

  • Each situation is different and which method you use depends on many factors
  • Also you must think carefully about the assumptions you are making
  • For example, if you mean (or PMM) replace you are assuming the data are normal
  • Thee are other advanced methods for specific situation: “hot decking” for survey research; nearest neighbor, splines, or autoregressive methods for time-series or longitudinal data
  • Best practice: compare serval logical replacement methods and also compare it to listwise removal. If the story you are trying to tell is very different based each method and very different as it relates to the listwise removal it suggests you have not factored in some assumption correctly and best to rethink what you are doing.
