ADVERTISEMENT

Methodolgical Progress Note: Handling Missing Data in Clinical Research

Journal of Hospital Medicine 15(4). 2020 April;237-239. Published Online First November 20, 2019 | 10.12788/jhm.3330
Author and Disclosure Information

© 2019 Society of Hospital Medicine

UNDERSTANDING THE REASONS FOR MISSING DATA

Different data sources are likely to have unique reasons for missing values due to the workflows involved in how the data are collected. In research involving the use of data from electronic medical records, missing data on specific diagnoses involving patients who are regularly engaged in care are often considered to be “not present” or “normal”, since clinical documentation workflows are largely governed by the concept of “documentation by exception” in which diagnoses are documented only when there is an exception to the expectation that these are not present. For example, “diabetes mellitus” is commonly documented, but “diabetes mellitus not present” is rarely documented in electronic medical records which are used for clinical care. Thus, lack of explicit documentation is likely to indicate that diabetes mellitus is, in fact, not present.

Certain variables may be missing simply because there is no quantifiable value­—ie, the data do not exist. Structural missingness refers to a value that does not exist for a logical reason (eg, “What is the gender of your first child?” for those who do not have a child). Censoring, which occurs during “time to event” analysis, refers to a situation where information about a subject stops before the event of interest happens, for example, when a subject in a study involving a 30-day outcome dies at day 14. The term “limit of detection” refers to the lowest or highest level at which two distinct values can reasonably be distinguished (eg, the lower limit of detection of a C-reactive protein assay may be 1 mg/dL, so lower values might simply be reported by the lab as <1 mg/dL).5 These types of missing data require specific methods that are not discussed in this review.

These examples illustrate that approaches to dealing with missing data vary depending on what data sources are used and how data are collected. Understanding the reasons missing data are present is a necessary step in formulating a robust analytic approach to handling missing data.

MISSING DATA PATTERNS AND MECHANISMS

Missing Data Patterns

Evaluating missing data patterns provides information on the degree and complexity of the missing data problem and can aid in choosing an appropriate missing data handling method. This is because some analytic methods work well for a general pattern (nonmonotone) and other methods work for special patterns (eg, monotone, file matching). In longitudinal studies, missing data is commonly missing in a monotone pattern, where once one variable is missing then all subsequent variables are also missing for a particular subject. This occurs when a study participant is lost to follow-up. For example, a monotone missing data pattern may occur in a study that requires a series of follow-up visits for laboratory blood tests. If a patient drops out, it results in a monotone missing data pattern, as no data on blood test results are available once the patient drops out. If the patient just skips an intermediate visit but returns for the final blood test, this would show a nonmonotone missing data pattern. A file-matching pattern occurs when variables are never observed together. This pattern can occur when data from several studies are merged and some variables are not collected in all studies. For example, three studies are merged and all three collect blood pressure, but only one study collects age and only one study collects sex.