Methodological Progress Note: Classification and Regression Tree Analysis

Journal of Hospital Medicine 15(9). 2020 September;549-551. Published Online First March 18, 2020 | 10.12788/jhm.3366

March 18, 2020|Journal of Hospital Medicine

Charlie M Wray, DO, MS;Amy L Byers, PhD, MPH

Classification Versus Regression Trees

While commonly grouped together, CARTs can be distinguished from one another based on the dependent, or outcome, variable. Categorical outcome variables require the use of a classification tree, while continuous outcomes utilize regression trees. Of note, the independent, or predictor, variables can be any combination of categorical or continuous variables. However, splitting at each node creates categorical output when using CART algorithms.

Splitting Criteria

The splitting of each node is based on reducing the degree of “impurity” (heterogeneity with respect to the outcome variable) within each node. For example, a node that has no impurity will have a zero error rate labeling its binary outcomes. While CART works well with categorical variables, continuous variables (eg, age) can also be assessed, though only with certain algorithms. Several different splitting criteria exist, each of which attempt to maximize the differences within each child node. While beyond the scope of this review, examples of popular splitting criteria are Gini, entropy, and minimum error.⁵

Stopping Rules

To manage the size of a tree, CART analysis allows for predefined stopping rules to minimize the extent of growth while also establishing a minimal degree of statistical difference between nodes that is considered meaningful. To accomplish this task, two stopping rules are often used. The first defines the minimum number of observations in child, or “terminal,” nodes. The second defines the maximum number of levels a tree may grow, thus allowing the investigator to decide the total number of predictor variables that can define a terminal node. While several other stopping rules exist, these are the most commonly utilized.

Pruning

To avoid missing important associations due to premature stoppage, investigators may use another mechanism to limit tree growth called “pruning.” For pruning, the first step is to grow a considerably large tree that includes many levels or nodes, possibly to the point where there are just a few observations per terminal node. Then, similar to the residual sum of squares in a regression, the investigator can calculate a misclassification cost (ie, goodness of fit) and select the tree with the smallest cost.² Of note, stopping rules and pruning can be used simultaneously.

Classification Error

Similar to other forms of statistical inference it remains important to understand the uncertainty within the inference. In regression modeling, for example, classification errors can be calculated using standard errors of the parameter estimates. In CART analysis, because random samples from a population may produce different trees, measures of variability can be more complicated. One strategy is to generate a tree from a test sample and then use the remaining data to calculate a measure of the misclassification cost (a measure of how much additional accuracy a split must add to the entire tree to warrant the additional complexity). Alternatively, a “k-fold cross-validation” can be performed in which the data is broken down into k subsets from which a tree is created using all data except for one of the subsets. The computed tree is then applied to the remaining subset to determine a misclassification cost. These classification costs are important as they also impact the stopping and pruning processes. Ultimately, a final tree, which best limits classification errors, is selected.

References

Download Article