Evaluating automated rules for rapid response system alarm triggers in medical and surgical patients
BACKGROUND
The use of rapid response systems (RRS), which were designed to bring clinicians with critical care expertise to the bedside to prevent unnecessary deaths, has increased. RRS rely on accurate detection of acute deterioration events. Early warning scores (EWS) have been used for this purpose but were developed using heterogeneous populations. Predictive performance may differ in medical vs surgical patients.
OBJECTIVE
To evaluate the performance of published EWS in medical vs surgical patient populations.
DESIGN
Retrospective cohort study.
SETTING
Two tertiary care academic medical center hospitals in the Midwest totaling more than 1500 beds.
PATIENTS
All patients discharged from January to December 2011.
INTERVENTION
None.
MEASUREMENTS
Time-stamped longitudinal database of patient variables and outcomes, categorized as surgical or medical. Outcomes included unscheduled transfers to the intensive care unit, activation of the RRS, and calls for cardiorespiratory resuscitation (“resuscitation call”). The EWS were calculated and updated with every new patient variable entry over time. Scores were considered accurate if they predicted an outcome in the following 24 hours.
RESULTS
All EWS demonstrated higher performance within the medical population as compared to surgical: higher positive predictive value (P < .0001 for all scores) and sensitivity (P < .0001 for all scores). All EWS had positive predictive values below 25%.
CONCLUSIONS
The overall poor performance of the evaluated EWS was marginally better in medical patients when compared to surgical patients. Journal of Hospital Medicine 2017;12:217-223. © 2017 Society of Hospital Medicine
© 2017 Society of Hospital Medicine
DISCUSSION
We hypothesized that EWS may perform differently when applied to medical rather than surgical patients. Studies had not analyzed this in a time-dependent manner,24-26 which limited the applicability of the results.8
All analyzed scores performed better in medical patients than in surgical patients (Figure). This could reflect a behavioral difference by the teams on surgical and medical floors in the decision to activate the RRS, or a bias of the clinicians who designed the scores (mostly nonsurgeons). The difference could also mean that physiological deteriorations are intrinsically different in patients who have undergone anesthesia and surgery. For example, in surgical patients, a bleeding episode is more likely to be the cause of their physiological deterioration, or the lingering effects of anesthesia could mask underlying deterioration. Such patients would benefit from scores where variables such as heart rate, blood pressure, or hemoglobin had more influence.
When comparing the different scores, it was much easier for a patient to meet the alerting score with the Worthing score than with GMEWS. In the Worthing score, a respiratory rate greater than 22 breaths per minute, or a systolic blood pressure less than 100 mm Hg, already meet alerting criteria. Similar vital signs result in 0 and 1 points (respectively) in GMEWS, far from its alerting score of 5. This reflects the intrinsic tradeoff of EWS: as the threshold for considering a patient “at risk” drops, not only does the number of true positives (and the sensitivity) increase, but also the number of false positives, thus lowering the PPV.
However, none of the scores analyzed were considered to perform well based on their PPV and sensitivity, particularly in the surgical subpopulation. Focusing on another metric, the area under the receiver operator curve can give misleadingly optimistic results.24,27 However, the extremely low prevalence of acute physiological deterioration can produce low PPVs even when specificity seems acceptable, which is why it is important to evaluate PPV directly.28
To use EWS effectively to activate RRS, they need to be combined with clinical judgment to avoid high levels of false alerts, particularly in surgical patients. It has been reported that RRS is activated only 30% of the time a patient meets RRS calling criteria.29 While there may be cultural characteristics inhibiting the decision to call,30 our study hints at another explanation: if RRS was activated every time a patient met calling criteria based on the scores analyzed, the number of RRS calls would be very high and difficult to manage. So health providers may be doing the right thing when “filtering” RRS calls and not applying the criteria strictly, but in conjunction with clinical judgment.
A limitation of any study like this is how to define “acute physiological deterioration.” We defined an event as recognized episodes of acute physiological deterioration that are signaled by escalations of care (eg, RRS, resuscitation calls, or transfers to an ICU) or unexpected death. By definition, our calculated PPV is affected by clinicians’ recognition of clinical deteriorations. This definition, common in the literature, has the limitation of potentially underestimating EWS’ performance by missing some events that are resolved by the primary care team without an escalation of care. However, we believe our interpretation is not unreasonable since the purpose of EWS is to trigger escalations of care in a timely fashion. Prospective studies could define an event in a way that is less affected by the clinicians’ judgment.
Regarding patient demographics, age was similar between the 2 groups (average, 58.2 years for medical vs 58.9 years for surgical), and there was only a small difference in gender ratios (45.1% male in the medical vs 51.4% in the surgical group). These differences are unlikely to have affected the results significantly, but unknown differences in demographics or other patient characteristics between groups may account for differences in score performance between surgical and medical patients.
Several of the EWS analyzed had overlapping trigger criteria with our own RRS activation criteria (although as single-parameter triggers and not as aggregate). To test how these potential biases could affect our results, we performed a post hoc sensitivity analysis eliminating calls to the RRS as an outcome (so using the alternative outcome of unexpected transfers to the ICU and resuscitation calls). The results are similar to those of our main analysis, with all analyzed scores having lower sensitivity and PPV in surgical hospitalizations when compared to medical hospitalizations.
Our study suggests that, to optimize detection of physiological deterioration events, EWS should try to take into account different patient types, with the most basic distinction being surgical vs medical. This tailoring will make EWS more complex, and less suited for paper-based calculation, but new electronic health records are increasingly able to incorporate decision support, and some EWS have been developed for electronic calculation only. Of particular interest in this regard is the score developed by Escobar et al,31 which groups patients into categories according to the reason for admission, and calculates a different subscore based on that category. While the score by Escobar et al. does not split patients based on medical or surgical status, a more general interpretation of our results suggests that a score may be more accurate if it classifies patients into subgroups with different subscores. This seems to be confirmed by the fact that the score by Escobar et al performs better than MEWS.28 Unfortunately, the paper describing it does not provide enough detail to use it in our database.
A recent systematic review showed increasing evidence that RRS may be effective in reducing CRAs occurring in a non-ICU setting and, more important, overall inhospital mortality.32 While differing implementation strategies (eg, different length of the educational effort, changes in the frequency of vital signs monitoring) can impact the success of such an initiative, it has been speculated that the afferent limb (which often includes an EWS) might be the most critical part of the system.33 Our results show that the most widely used EWS perform significantly worse on surgical patients, and suggest that a way to improve the accuracy of EWS would be to tailor the risk calculation to different patient subgroups (eg, medical and surgical patients). Plausible next steps would be to demonstrate that tailoring risk calculation to medical and surgical patients separately can improve risk predictions and accuracy of EWS.