The American Journal of Managed Care March 2009
Voice Response System to Measure Healthcare Costs: A STAR*D Report
Moderate underreporting biases were found when patient responses to an interactive voice response system were compared with medical records in the STAR*D clinical trial.
Objective: To evaluate a telephone-operated, interactive voice response (IVR) system designed to collect use-of-care data from patients with major depression (UAC-IVR).
Study Design: Patient self-reports from repeated IVR surveys were compared with provider records for 3789 patients with major depression at 41 clinical sites participating in the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial.
Methods: UAC-IVR responses were examined for consistency and compared with provider records to compute reporting biases and: intraclass correlation coefficients. Predictors of inconsistent responses and reporting biases were based on mixed logistic and regression models adjusted for need and predisposing and enabling covariates, and corrected for nesting and repeated measures.
Results: Inconsistent responses were found for 10% of calls and 21% of patients. Underreporting biases (−20%) and moderate agreement (intraclass correlation of 68%) were found when UAC-IVR responses were compared with medical records. IVR reporting biases were less for patients after 3 calls or more (experience), for patients with severe baseline symptoms (motivation), and for patients who gave consistent IVR responses (reliability). Bias was unrelated to treatment outcomes or demographic factors.
Conclusion: Clinical managers should use IVR systems to collect service histories only after patients are properly trained and responses monitored for consistency and reporting biases.
(Am J Manag Care. 2009;15(3):153-162)
The feasibility of interactive voice response (IVR) systems to collect use-of-care data was assessed in a large clinical trial (STAR*D) involving 41 clinics and 4041 patients with major depression.
- Moderate intraclass correlation and underreporting biases were found when patient responses were compared with medical records.
- Reporting biases varied with baseline symptoms and IVR experience, but not treatment outcomes, demographic characteristics, or care attitudes.
- Clinical managers should use IVR systems to collect service histories only after patients are properly trained and responses monitored for consistency.
In this study, we evaluate the performance of an interactive voice response (IVR) system that collected healthcare utilization and costs information from a computerized script administered by phone (UACIVR) for the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study.4-8 STAR*D followed approximately 4000 patients who were being treated for nonpsychotic major depressive disorder (MDD) by 400 clinicians at 41 sites in both specialty and primary care settings, and in both the public and private sectors. To capture use-of-care information, study subjects were asked to dial a centralized number, listen to instructions, and answer computer-scripted questions by pressing keys on a touch-tone telephone pad.
As a data collection tool, IVR systems have gained both public9-12 and clinical13,14 acceptance. Compared with personal interviews, IVR responses are associated with lower collection costs, greater patient convenience, and fewer transcription errors.15 These systems also allow for remote data access, automated scoring,16 patient feedback,17 and opportunities for self-disclosure of sensitive information.16,18-21 Furthermore, IVR technology has been applied to studies on alcohol use,22,23 cognitive functioning,24 work and social adjustment,25 chronic insomnia,26 smoking cessation,27 depressive symptoms,28 and obsessive- compulsive disorder.29 Good reliability has been reported when IVR results are compared with responses from written questionnaires and personal interviews.30 A high correspondence has been found between psychiatric diagnoses based on the Primary Care Evaluation of Mental Disorders (PRIME-MD) screening instrument using IVR technology and those obtained using the Structured Clinical Interview for DSM-IV (SCID-IV) interview.31 The specificity and sensitivity of an IVR mental health screener for identifying anxiety and depressive disorders, obsessive-compulsive disorders, eating disorders, and alcohol use disorders also have been demonstrated.32
Prior studies have not focused on IVRs as a data collection tool to measure patient total use of care. In this study, we evaluated STAR*D’s new use-of-care survey, UAC-IVR, for both consistency and reliability. Consistency was determined by comparing responses to questions that asked if any care was used (yes/no) with questions that asked how much care was used (greater than zero/none). Reliability was assessed by comparing survey responses with provider records.
The STAR*D consent protocol1 and study design4-8 are described elsewhere. Briefly, subjects signed an institutional review board–approved informed consent form and were followed through prospective and sequenced treatments for MDD. Patients who responded to treatment or achieved remission were followed for an additional year. Data were collected from both providers and patients. Patient information was solicited from written questionnaires, face-to-face and telephone interviews, and patient-initiated IVR calls administered by Healthcare Technology Systems, Inc. in Madison, Wisconsin. STAR*D research staff helped patients make calls at baseline, after 6 weeks at each treatment level, at the end of each treatment level, at monthly intervals during the 12-month follow-up, and at study exit. To make a call, patients first dialed a toll-free number using a touch-tone telephone. The caller received recorded instructions, followed by a set of questions. After each question, the recorded message prompted patients to respond by pressing an appropriate number on the telephone keypad. The computer then recorded each response and automatically determined the next set of scripted questions to ask the patient.
Scripted questions covering patient use of healthcare during 90-day intervals are presented in the Figure. Questions were derived from the Utilization and Cost Methodology (UAC).3,33-36 Each time subjects accessed the IVR server, the computer checked to see whether the use-of-care script had ran within the past 90 days. This strategy minimized risks of double-counting services from overlapping observation periods between IVR calls. Periods not covered by an IVR call were treated as missing.
Respondents were first asked whether they had used care during the past 3 months (yes or no). Patients who responded “yes” were subsequently asked how much care they had used. Patients were asked about using services classified by setting (outpatient clinic visits, emergency room visits, and inpatient days stayed) and by type (depression related, other-psychiatric, and general medical problems). For evaluative purposes, a response was considered “inconsistent” if the respondent answered “yes” to using care while subsequently reporting that “zero” days or visits were actually used. Responses that were not inconsistent were considered consistent.
To assess reliability, provider data were obtained from billing claims and medical charts for patients signing a medical release. STAR*D focused on depression-related care; thus, records for services not related to depression were generally unavailable, and these analyses were limited to depressionrelated outpatient visits only. Services were classified by Current Procedural Terminology (CPT),37 the level 1 Healthcare Common Procedure Coding System,38 and psychiatric diagnoses based on the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV).39 Abstractors counted the number of outpatient visits that were depression related (DSM-IV 296, 311) but not emergency room related (CPT 99281, 99282, 99283, 99284, 99285, 99288) for each 90-day period ending on the date of the respective IVR survey.
Patient demographic, education, earnings, employment, and health insurance information were taken from face-toface and telephone interviews. Patients rated (agreed or neutral/disagreed): “If I can get the help I need from a doctor, I believe that I will be much better able: (a) to make important decisions that affect my life and those of my family? and (b) to enjoy things that interest me?” Patients also rated (helpful vs neutral/not helpful): “the current overall impact of your family and friends on your condition?” Also collected using the IVR system were family size, health insurance status, Medicaid eligibility, and mental and physical functioning based on the Medical Outcome Study 12-item short form.40
For purposes of these analyses, treatment outcomes were based on the 17-item Hamilton Rating Scale for Depression (HRSD17)41,42 administered by telephone at the end of the first treatment step (citalopram), and on the written, patient selfreported, 16-item Quick Inventory of Depressive Symptomatology (QIDS-SR16),43-46 administered at baseline and at each STAR*D clinic visit. Treatment outcomes were computed as (1) remission defined by an HRSD17 score of 7 or lower at exit from the first treatment step, (2) remission defined by a QIDS-SR16 score of 5 or lower, and (3) response defined by a reduction in QIDS-SR16 score from baseline of 50% or more. Patients with missing HRSD17 scores were not considered to have achieved remission.4
Interactive voice response use-of-care responses were evaluated for both consistency and reliability. To assess reliability, counts of depression-related, nonemergency outpatient visits based on IVR responses were compared with provider records that spanned comparable time periods to compute bias (mean difference) and intraclass correlation from 2-way mixed models.47,48
The associations of selected predictor variables with response inconsistency were computed from 3-level mixed logistic models. Similarly, the associations of selected predictor variables with reporting biases were computed from 3-level mixed regression models. Both sets of models were computed using Hierarchical Linear Modeling software,49 where level 1 is IVR calls, level 2 is individual patients, and level 3 is study sites. Both models corrected for facility nesting and repeated measures with random-effects terms. Estimates of the association for each selected predictor variable were adjusted, in turn, for the mean-centered values of a given set of covariates. Based on traditional theory,50 these covariates included need (baseline QIDS-SR16, age), predisposing (graduated high school, Hispanic, African American, sex), and enabling (married, enrolled in private health insurance plan, and employed) variables. An additional covariate was added: the order of the call. To account for complex error distributions, significance tests were based on robust estimates of standard errors.
STAR*D enrolled 4041 subjects, of whom 94% (n = 3789) completed an IVR use-of-care script, making 9864 calls, or 2.6 calls per patient (SD = 1.7, range = 1-8) (Table 1). Among the 3789 subjects who completed the script, 17% (655 of 3789) were African American (excluding Hispanic black); 12% (464 of 3784) were Hispanic; 63% (2377 of 3788) were female; 34% (1270 of 3785) were married; 57% (2157 of 3784) were employed; 88% (3315 of 3784) had a high school diploma, General Educational Development (GED) certification, or higher; 12% (458 of 3691) had Medicaid coverage; and 51% (1911 of 3715) had private health insurance. The mean age was 41.2 years (SD = 13.2 years).
Use-of-care information obtained from IVR calls is summarized in Table 2. All patients reported using depressionrelated care on at least 1 call. For all calls activating the use-of-care script, 50% (4921 of 9864) reported depressionrelated care, 36% (3545 of 9864) reported general medical care, 12% (1137 of 9864) reported other psychiatric care, 2% (166 of 9864) reported an inpatient stay for depression, and 3% (311 of 9864) reported an inpatient stay for general medical purposes.
Table 2 also lists by setting the number of sites, subjects, and IVR calls that contained inconsistent use-of-care responses. There were 1069 instances of an inconsistent response, among 944 of 9864 (10%) IVR calls, from 778 of 3789 (21%) participants, at 39 of 41 (95%) study sites. By comparison, 14% (537 of 3745) of patients gave inconsistent employment and earnings responses.