I have a database that contains records with incomplete data; some research participants did not complete all of the available questions on my survey. How should I handle this problem?
Missing or incomplete data are a serious problem in many fields of research. An added complication is that the more data that are missing in a database, the more likely it is that you will need to address the problem of incomplete cases, yet those are precisely the situations where imputing or filling in values for the missing data points is most questionable due to the small proportion of valid data points relative to the size of the data matrix. This FAQ highlights commonly-used methods of handling incomplete data problems. It discusses a number of their known strengths and weaknesses. At the end of the FAQ a software table is provided that compares and contrasts some commonly-used software options for handling missing data and details their availability to University of Texas faculty, students, and staff.
When you choose a missing data handling approach, keep in mind that one of the desired outcomes is maintaining (or approximating as closely as possible) the shape of the original distribution of responses. Some incomplete data handling methods do a better job of maintaining the distributional shape than others. For instance, one popular method of imputation, mean substitution, can result in a distribution with truncated variance.
If you have questions about the advisability of applying a particular method to your own database, we recommend you schedule an appointment with a Statistical Services consultant to discuss these issues as they pertain to your own unique circumstances (note: This service is available to University of Texas faculty, staff, and students only). Missing data imputation and handling is a rapidly evolving field with many methods, each applicable in some circumstances but not others.
Types of missing data
The most appropriate way to handle missing or incomplete data will depend upon how data points became missing. Little and Rubin (1987) define three unique types of missing data mechanisms.
In practice it is usually difficult to meet the MCAR assumption. MAR is an assumption that is more often, but not always tenable. The more relevant and related predictors one can include in statistical models, the more likely it is that the MAR assumption will be met.
Methods of handling missing data
Some of the more popular methods for handling missing data appear below. This list is not exhaustive, but it covers some of the more widely recognized approaches to handling databases with incomplete cases.
Roth (1994) reviews these methods and concludes, as did Little & Rubin (1987) and Wothke (1998), that listwise, pairwise, and mean substitution missing data handling methods are inferior when compared with maximum likelihood based methods such as raw maximum likelihood or multiple imputation. Regression methods are somewhat better, but not as good as hot deck imputation or maximum likelihood approaches. The EM method falls somewhere in between: It is generally superior to listwise, pairwise, and mean substitution approaches, but it lacks the uncertainty component contained in the raw maximum likelihood and multiple imputation methods.
It is important to understand that these missing data handling methods and the discussion that follows deal with incomplete data primarily from the perspective of estimation of parameters and computation of test statistics rather than prediction of values for specific cases. Warren Sarle at SAS Institute has put together a helpful paper on the topic of missing data in the contexts of prediction and data mining. The paper can be found online in postscript form at ftp://ftp.sas.com/pub/neural/JCIS98.ps.
Hot deck and maximum likelihood-based approaches to handling missing dataHot deck
Hot deck imputation fills in missing cells in a data matrix with the next most similar case's values. Consider the following example database.
Case |
Item 1 |
Item 2 |
Item 3 |
Item 4 |
1 |
4 |
1 |
2 |
3 |
2 |
5 |
4 |
2 |
5 |
3 |
3 |
4 |
2 |
Case three has a missing data cell for item four. Hot deck imputation examines the cases with complete records (cases one and two in this example) and substitutes the value of the most similar case for the missing data point. In this example there are two complete cases to choose from: cases one and two. Case two is more similar to case three, the case with the missing data point, than in case one. Case two and case three have the same values for items two and three whereas case one and case three have the same value for item three only. Therefore, case two is more similar to case three than is case one. Note: There are different strategies for how to judge similarity.
Once the hot deck imputation determines which case among the observations with complete data is the most similar to the record with incomplete data, it substitutes the most similar complete case's value for the missing variable into the data matrix.
Case |
Item 1 |
Item 2 |
Item 3 |
Item 4 |
1 |
4 |
1 |
2 |
3 |
2 |
5 |
4 |
2 |
5 |
3 |
3 |
4 |
2 |
5 |
Since case two had the value of five for item four, the hot deck procedure imputes a value of five for case three to replace the missing data cell. Data analysis may then proceed using the new complete database.
Hot deck imputation has a long history of use, including years of use by the United States Census Bureau. It can be superior to listwise deletion, pairwise deletion, and mean substitution approaches to handling missing data. Among hot deck's advantages are its conceptual simplicity, its maintenance of the proper measurement level of variables (categorical variables remain categorical and continuous variables remain continuous), and the availability of a complete data matrix at the end of the imputation process that can be analyzed like any complete data matrix. One of hot deck's disadvantages is the difficulty in defining "similarity"; there may be any number of ways to define what similarity is in this context. Thus, the hot deck procedure is not an "out of the box" approach to handling incomplete data. Instead it requires that you develop custom software syntax to perform the selection of donor cases and the subsequent imputation of missing values in your database. More sophisticated hot deck algorithms would identify more than one similar record and then randomly select one of those available donor records to impute the missing value or use an average value if that were appropriate.
Two examples of SAS macros used to perform hot deck imputation can be found online. John Stiller and Donald R. Dalzell (1998) wrote a paper titled "Hot-deck Imputation with SAS® Arrays and Macros for Large Surveys" which can be found at http://www2.sas.com/proceedings/sugi23/Stats/p246.pdf. Lawrence Altmayer from the U.S. Bureau of the Census wrote a paper "Hot-Deck Imputation: A Simple DATA Step Approach" which can be found at http://www8.sas.com/scholars/05/PREVIOUS/1999/pdf/075.pdf.
Expectation maximization (EM)
The expectation maximization (EM) approach to missing data handling is documented extensively in Little & Rubin (1987). The EM approach is an iterative procedure that proceeds in two discrete steps. First, in the expectation (E) step the procedure computes the expected value of the complete data log likelihood based upon the complete data cases and the algorithm's "best guess" as to what the sufficient statistical functions are for the missing data based upon the model specified and the existing data points; actual imputed values for the missing data points need not be generated. In the maximization (M) step it substitutes the expected values (typically means and covariances) for the missing data obtained from the E step and then maximizes the likelihood function as if no data were missing to obtain new parameter estimates. The new parameter estimates are substituted back into the E step and a new M step is performed. The procedure iterates through these two steps until convergence is obtained. Convergence occurs when the change of the parameter estimates from iteration to iteration becomes negligible.
The SPSS Missing Values Analysis (MVA) module employs the EM approach to missing data handling. The strength of the approach is that it has well-known statistical properties and it generally outperforms popular ad hoc methods of incomplete data handling such as listwise and pairwise data deletion and mean substitution because it assumes incomplete cases have data missing at random (MAR) rather than missing completely at random (MCAR). The primary disadvantage of the EM approach is that it adds no uncertainty component to the estimated data. Practically speaking, this means that while parameter estimates based upon the EM approach are reliable, standard errors and associated test statistics (e.g., t-tests) are not. This shortcoming led statisticians to develop two newer likelihood-based methods for handling missing data, the raw maximum likelihood approach and multiple imputation.
Raw maximum likelihood
Raw maximum likelihood, also known as Full Information Maximum Likelihood (FIML), methods use all available data points in a database to construct the best possible first and second order moment estimates under the MAR assumption. Put less technically, if the missing at random (MAR) assumption can be met, maximum likelihood-based methods can generate a vector of means and a covariance matrix among the variables in a database that is superior to the vector of means and covariance matrix produced by commonly-used missing data handling methods such as listwise deletion, pairwise deletion, and mean substitution. See Wothke (1998) for a convincing demonstration.
Under an unrestricted mean and covariance structure, raw maximum likelihood and EM return identical parameter estimate values. Unlike EM, however, raw maximum likelihood can be employed in the context of fitting user-specified linear models, such as structural equation models, regression models, ANOVA and ANCOVA models, etc. Raw maximum likelihod also produces standard errors and parameter estimates under the assumption that the fitted model is not false, so parameter estimates and standard errors are model-dependent. That is, their values will depend upon the model chosen and fitted by the investigator.
Raw maximum likelihood missing data handling is currently implemented in the AMOS structural equation modeling package currently supported by ITS. The primary advantage of this method from a practical standpoint is that it is built in to the software package: the AMOS user simply clicks on a check box to enable missing data handling. The program then fits the analyst's model using the raw maximum likelihood missing data handling approach. Any general linear model including ANOVA, ANCOVA, MANOVA, MANCOVA, path analysis, confirmatory factor analysis, and numerous time series and longitudinal models can be fit using AMOS.
Other software packages that use the raw maximum likelihood approach to handle incomplete data are the MIXED procedure in SAS and SPSS (see the paper titled "Linear mixed-effects modeling in SPSS”) and Michael Neale's MX. The MIXED procedure can fit ANOVA, ANCOVA, and repeated measures models with time-constant and time-varying covariates. You should strongly consider using a MIXED procedure instead of SAS PROC GLM or the SPSS General Linear Models (GLM) procedures whenever you have repeated measures data with missing data points. The MIXED procedures can also fit hierarchical linear models (HLMs), also known as multilevel or random coefficient models. MX is a freeware structural equation modeling program.
Raw maximum likelihood has the advantage of convenience/ease of use and well-known statistical properties. Unlike EM, it also allows for the direct computation of appropriate standard errors and test statistics. Disadvantages include an assumption of joint multivariate normality of the variables used in the analysis and the lack of a raw data matrix produced by the analysis. Recall that the raw maximum likelihood method only produces a covariance matrix and a vector of means for the variables; the statistical software then uses these as imputes for further analyses.
Raw maximum likelihood methods are also model-based. That is, they are implemented as part of a fitted statistical model. The investigator may want to include relevant variables (e.g., reading comprehension) that will improve the accuracy of parameter estimates, but not include these variables in the statistical model as predictors or outcomes. While it is possible to do this, it is not always easy or convenient, particularly in large or complex models.
Finally, raw maximum likelihood assumes the incomplete data cells are missing at random. Wothke (1998) suggests, however, that raw maximum likelihood can offer superior performance to listwise and pairwise deletion methods even in the nonignorable data situation.
Multiple imputation
Multiple imputation combines the well-known statistical advantages of EM and raw maximum likelihood with the ability of hot deck imputation to provide a raw data matrix to analyze. Multiple imputation works by generating a maximum likelihood-based covariance matrix and vector of means, like EM. Multiple imputation takes the process one step further by introducing statistical uncertainty into the model and using that uncertainty to emulate the natural variability among cases one encounters in a complete database. Multiple imputation then imputes actual data values to fill in the incomplete data points in the data matrix, just as hot deck imputation does.
The primary difference between multiple imputation and hot deck imputation from a practical or procedural standpoint is that multiple imputation requires that the data analyst generate five to ten databases with imputed values. The data analyst then analyzes each database, collects the results from the analyses, and summarizes them into one summary set of findings. For instance, suppose a researcher wishes to perform a multiple regression analysis on a database with incomplete data. The researcher would run multiple imputation, generate ten imputed databases, and run the multiple regression analysis on each of the ten databases. The researcher then combines the results from the ten regression analyses together into one summary for presentation, not necessarily a trivial task.
Multiple imputation has several advantages: It is fairly well-understood and robust to violations of non-normality of the variables used in the analysis. Like hot deck imputation, it outputs complete raw data matrices. It is clearly superior to listwise, pairwise, and mean substitution methods of handling missing data in most cases. Disadvantages include the time intensiveness in imputing five to ten databases, testing models for each database separately, and recombining the model results into one summary. Furthermore, summary methods have been worked out for linear and logistic regression models, but work is still in progress to provide statistically appropriate summarization methods for other models such as factor analysis, structural equation models, multinomial logit regression models, etc.
Schafer (1997) thoroughly documents multiple imputation theory in a textbook. Schafer has also written the freeware PC program NORM to perform multiple imputation analysis. SAS users can review a set of SAS macro programs called SIRNORM that perform multiple imputation. Another freeware program similar to NORM called Amelia may also be downloaded.
Pattern-mixture models for non-ignorable missing data
All the methods of missing data handling considered above require that the data meet the Little & Rubin (1987) missing at random (MAR) assumption. There are circumstances, however, when this assumption cannot be met to a satisfactory degree; cases are considered missing due to non-ignorable causes (Heitjan, 1997). In such instances the investigator may want to consider the use of a pattern-mixture model, a term used by Hedeker & Gibbons (1997). Earlier works dealing with pattern-mixture models include Little & Schenker (1995), Little (1993), and Glynn, Laird, & Rubin (1986).
Pattern-mixture models categorize the different patterns of missing values in a dataset into a predictor variable, and this predictor variable is incorporated into the statistical model of interest. The investigator can then determine if the missing data pattern has any predictive power in the model, either by itself (a main effect) or in conjunction with another predictor (an interaction effect).
The chief advantage of the pattern-mixture model is that it does not assume the incomplete data are missing at random (MAR) or missing completely at random (MCAR). The primary disadvantage of the pattern-mixture model approach is that it requires some custom programming on the part of the data analyst to obtain one part of the pattern-mixture analysis, the pattern-mixture averaged results. It is worth noting, however, that Hedeker & Gibbons (1997, Appendix) demonstrate that some results may be obtained by using the SAS MIXED procedure and they provide sample SAS/IML code to obtain pattern-mixture averaged results on their Web site. If the number of missing data patterns and the number of variables with missing data are large relative to the number of cases in the analysis, the model may not converge due to insufficient data to support the use of many main effect and interaction terms.
Conclusions
Although applied researchers cannot turn to a single "one size fits all" solution for handling incomplete data problems, several trends in the missing data analysis literature are worth noting. First, ad hoc and commonly-used methods of handling incomplete data such as listwise and pairwise deletion and mean substition are inferior to hot deck imputation, raw maximum likelihood, and multiple imputation methods in most situations. Second, software to perform hot deck, raw maximum likelihood, and multiple imputation is becoming more widely available and easier to use.
Although all of the methods described so far assume the incomplete data are missing at random, new statistical models are being developed to handle data missing due to nonignorable factors. Some of these models can be partially fit using familiar statistical packages and procedures such as the MIXED procedure in either SAS (e.g., Hedeker & Gibbons, 1997) or SPSS (see the paper titled "Linear mixed-effects modeling in SPSS”).
References
Glynn, R., Laird, N.M., & Rubin, D.B. (1986). Selection modeling versus mixture modeling with nonignorable nonresponse. In H. Wainer (ed.) Drawing Inferences from Self-Selected Samples, 119-146. New York: Springer-Verlag.
Graham, J.W., Hofer, S.M., & MacKinnon, D.P. (1996). Maximizing the usefulness of data obtained with planned missing value patterns: An application of maximum likelihood procedures. Multivariate Behavioral Research, 31(2), 197-218.
Hedeker, D. & Gibbons, R.D. (1997). Application of random-effects pattern-mixture models for missing data in longitudinal studies. Psychological Methods, 2(1), 64-78.
Heitjan, D.F. (1997). Annotation: What can be done about missing data? Approaches to imputation. American Journal of Public Health, 87(4), 548-550.
Iannacchione, V. G. (1982). Weighted sequential hot deck imputation macros. Proceedings of the SAS Users Group International Conference, 7, 759- 763.
Little, R.J.A. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88, 125-124.
Little, R.J.A., & Schenker, N. (1995). Missing Data. In Arminger, Clogg, & Sobel (eds.) Handbook of Statistical Modeling for the Social and Behavioral Sciences. New York: Plenum.
Little, R.J.A. & Rubin, D.A. (1987). Statistical analysis with missing data. New York: John Wiley and Sons.
Roth, P. (1994). Missing data: A conceptual review for applied psychologists. Personnel Psychology, 47, 537-560.
Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. Book number 72 in the Chapman & Hall series Monographs on Statistics and Applied Probability. London: Chapman & Hall.
Wothke, W (1998). Longitudinal and multi-group modeling with missing data. In T.D. Little, K.U. Schnabel, & J. Baumert (Eds.) Modeling longitudinal and multiple group data: Practical issues, applied approaches and specific examples. Mahwah, NJ: Lawrence Erlbaum Associates.
Software Table
The table below specifies several commonly-used software options for handling missing or incomplete data. The table is not intended to be an exhaustive list of every possible missing data-handling software package. However, if you discover or know of another software option you have used successfully, please let us know by sending e-mail to us at the address listed at the bottom of this page.
The table lists the name of the software, the method of handling incomplete data, assumptions it makes about the causes of missing data, whether the package is supported at UT Austin, pricing and availability to UT faculty, students, and staff, and miscellaneous comments generally dealing with the perceived ease of use of the package from the perspective of computing novices. Note that in addition to the assumptions about the origins of incomplete data, many of the methods shown below also contain other tacit assumptions (e.g., joint multivariate normality of variables included in the analysis).
|
Name |
|
|
supported? |
and Availability |
|
|
|
|
at random (MAR) |
|
Free to download from the
|
|
|
(e.g., PROC STANDARD) |
substitution |
|
|
|
|
|
Multiple Imputation Programs |
Imputation |
at random (MAR) |
|
|
novices to use. |
|
|
Imputation |
|
|
|
novices to use. |
|
|
with bootstrapping option for covariance matrices |
at random (MAR) |
|
|
novices to use |
|
|
substitution |
|
|
|
|
|
Missing Values Analysis (MVA) add-in module |
|
at random (MAR) |
|
|
|
|
|
maximum likelihood |
|
|
|
|
|
|
maximum likelihood |
|
|
MX Web site. |
|
|
|
Imputation |
at random (MAR) |
|
NORM Web site. |
|
|
|
Regression |
|
|
for direct ordering information. |
interface appears fairly easy to use. |
|
|
mixture model approach |
|
|
|
in its entirety. |
|
SAS macro program set |
Imputation |
at random (MAR) |
|
|
use for novices. |
If you have further questions, send E-mail to stats@ssc.utexas.edu.