Screening for Depression and Suicidality in a VA Primary Care Setting: 2 Items Are Better Than 1 Item

The American Journal of Managed Care, November 2004 - Part 2, Volume 10, Issue 11 Pt 2

Objective: To evaluate the psychometric properties of a singleitemdepression screen against validated scoring algorithms for thePatient Health Questionnaire (PHQ) and the utility of these algorithmsin screening for depression and suicidality in a Departmentof Veterans Affairs (VA) primary care setting.

Study Design: Recruitment phase of a randomized trial.

Methods: A total of 1211 Portland VA patients with upcomingprimary care clinic appointments were administered by telephonea single item assessing depressed mood over the past year and thePHQ. The PHQ-9 (9 items) encompasses DSM-IV criteria for majordepression, the PHQ-8 (8 items) excludes the thoughts of death orsuicide item, and the PHQ-2 (2 items) assesses depressed moodand anhedonia. Patients whose responses suggested potential suicidalitywere administered 2 additional items assessing suicidalideation. Patients receiving mental health specialty care wereexcluded.

Results: Using the PHQ-9 algorithm for major depression as thereference standard, the VA single-item screen was specific (88%)but less sensitive (78%). A PHQ-2 score of ≥3 demonstrated similarspecificity (91%) with high sensitivity (97%). For case finding,the PHQ-8 was similar to the PHQ-9. Approximately 20% ofpatients screened positive for moderate depression, 7% reportedthoughts of death or suicide, 2% reported thoughts of harmingthemselves, and 1% had specific plans.

Conclusions: The PHQ-2 offers brevity and better psychometricproperties for depression screening than the single-item screen.The PHQ-9 item assessing thoughts of death or suicide does notimprove depression case finding; however, one third of patientsendorsing this item reported recent active suicidal ideation.

(Am J Manag Care. 2004;10(part 2):839-845)

Depression is common among patients in primarycare settings, yet it is underrecognized andundertreated by primary care providers.1-3Given the high prevalence, morbidity, and mortalityassociated with untreated depression, many medicalinstitutions have initiated systematic guideline-basedscreening programs.4-6 Widely used screening instrumentsinclude the Beck Depression Inventory, theCenter for Epidemiologic Studies Depression Screen(CES-D), and the Zung Self-Assessment DepressionScale.7 Compared with a standardized diagnostic instrument,these screens demonstrate very good sensitivityand fair to good specificity.8

Diagnostic and Statistical Manual of Mental Disorders,Revised 3rd Edition

Still, administering and evaluating the 20 or moreitems typically found in measures of depression can berelatively time-consuming, and therefore difficult tointegrate into busy primary care practices.9,10 Thus,shorter instruments have been introduced and tested.8,11-14 Of note, the recently developed 9-item PatientHealth Questionnaire (PHQ-9)15-17 is increasingly beingadministered and tested in clinical and research settings.18-23 The PHQ-9 has good sensitivity (88%) andspecificity (88%) for major depression compared with adiagnostic interview conducted by a mental health professionalusing SCID (Structured Clinical Interview for [DSM-III-R]) criteria.17The PHQ-9 offers concurrent validity with measuresof functional impairment, high internal consistencyand test-retest reliability, simplicity, and face validity15-19; in addition, severity scores may be used to trackchange over time.7,16,23-24

Looking at the shortest possible measures, Whooleyet al13 found that 2 items (measuring depressed moodand anhedonia over the past month) demonstratedexcellent sensitivity (96%) but only fair specificity (57%)compared with the Diagnostic Interview Schedule.Kroenke et al25 tested the validity of the first 2 items(depressed mood and anhedonia over the past 2 weeks)of the PHQ (PHQ-2) in a population of community primarycare and obstetrics-gynecology patients. Theyfound that a PHQ-2 score of 3 or higher (PHQ-2 ≥3) hada sensitivity of 83% and specificity of 92% comparedwith a diagnostic interview by a mental health professional. Also using a diagnostic interview as the criterion,Williams et al26 reported that the sensitivity and specificityfor a single question ("Have you felt depressed orsad much of the time in the past year?") approachedthat of the CES-D (85% vs 88% and 66% vs 75%, respectively).Although the data of Williams et al suggest that1 item performs well, the characteristics of their sample—predominantly female and Hispanic—limit generalizationto other settings.

In 1999, the Portland Veterans Affairs MedicalCenter (VAMC) primary care clinics introduced a similarsingle item ("Have you been depressed or sad mostof the past year?") as a routine annual depressionscreen. In contrast to Williams et al's population, the VApatient population is predominantly male, Caucasian,and older adults.27 The primary objective of this studywas to evaluate the sensitivity and specificity of the single-item screen with the PHQ-9 as the reference standardin a VA primary care clinic. We also sought toestimate the proportion of primary care patients not currentlyreceiving mental health specialty care who wouldscreen positive for depression and possible suicidality.



The study was conducted in the Portland VAMC primarycare clinics, which include 2 hospital-based and 2community-based clinics. In 2002, about 23 000patients were followed in these clinics. Our local populationis primarily older (mean age 62 years), Caucasian(87% of patients with recorded ethnicity) men (94%),reflecting national VA demographics. The modal panelsize for physicians is 1100-1200 patients; for nurse practitionersand physician assistants, 760-960 patients.

Study Sample and Procedure

In July 2002 we initiated recruitment for a randomized,controlled trial of a low-intensity collaborativeintervention for depression in primary care (DEP-PC).All patients screened for participation in DEP-PCbetween July 2002 and February 2003 were eligible forthe current study. Potential participants in DEP-PCwere identified by using computerized lists of patientsdue to see their primary care providers within a monthand whose primary care providers (n = 41) were participatingin DEP-PC. We excluded patients who hadreceived treatment from a mental health care clinicianwithin the prior 6-month period or who had Alzheimer'sdisease, cognitive problems, psychotic symptoms, orterminal illness documented in their medical records(the Figure).

Patients who met inclusion criteria for DEP-PCwere sent a brief letter outlining the study. One totwo weeks later, a research assistant telephoned,explained the purpose of the study, and asked permissionto continue with a 5-minute telephone interview.Up to 3 call attempts were made to reach eachpatient. It has been established that depression datacollected by telephone are comparable to depressiondata obtained by in-person interviews.28 Researchassistants were trained in procedures for obtainingclinical assistance for severely depressed or potentiallysuicidal patients. Investigators contacted patientswho expressed active suicidal ideation for assessmentand to offer care. The local institutional review boardapproved the study.

All eligible patients who agreed to be screened forDEP-PC were administered the PHQ and the single-itemscreen currently used in the primary care clinics. Overthe first 5 months of recruitment, 977 patients werescreened. Of the 587 patients who answered "not at all"to the first 2 PHQ-9 items (anhedonia and depressedmood), more than half (54%) also answered "not at all"to each of the remaining 7 PHQ items. Moreover, only 3of 587 (0.5%) patients had PHQ scores suggesting moderatedepression (PHQ-9 ≥ 10). Therefore, to limit thelength of screening calls, we began administering the fullPHQ-9 only when patients endorsed at least 1 of the first2 PHQ items. Those interviewed using this "abbreviatedscreen" who did not endorse either of the first 2 items (n= 167) received a score of zero and the interview ended.

Over the 7-month study period, 1447 veteransenrolled in the primary care clinics were contacted byphone for screening (the Figure). Of these, 1240 (85.7%)patients completed the screening, 171 (11.8%) declinedto be screened, and 36 (2.5%) indicated that they hadseen a mental health clinician in the past 6 months.Among the 1240 screened, 14 (0.1%) patients skipped 2or more PHQ items and 15 (0.1%) patients did notanswer the single-item screen, leaving a final samplesize of 1211. Veterans screened for DEP-PC were slightlymore likely to be Caucasian (93% of patients withrecorded ethnicity) and older (mean age 66 years) thanveterans in the general primary care population.


Patient Health Questionnaire-9.

The PHQ-9 depressionscale is derived from the PRIME-MD,15-17 a measureof mood, anxiety, alcohol, somatoform, and eating disorderswith demonstrated diagnostic and concurrentvalidity. Patients use an ordinal scale (0 = not at all, 1 =several days, 2 = more than half the days, 3 = nearlyevery day) to rate the frequency of symptoms of depressionover the past 2 weeks. The 9 items are based on the9 DSM-IV criteria for the diagnosis of depression,29 andtotal scores range from 0 to 27. Options for administeringand scoring the PHQ include using all 9 items, usingthe first 8 items (PHQ-8; excludes thoughts of death orsuicide item), and using only the first 2 items (PHQ-2;anhedonia and depressed mood items). For classification,either the cut-point system (score of 5-9 = mild,10-14 = moderate, 15-19 = moderately severe, and 20-27 = severe depression) or the algorithm developed andvalidated by Spitzer and his colleagues17 to be congruentwith the DSM-IV criteria ("major depression algorithm")can be used. The PRIME-MD also contains anitem to assess global functional impairment that can beadministered in conjunction with the PHQ as a 10thitem. Our screening protocol used the 9-item version,previously validated against clinician interview andmeasures of functional impairment.15,17,19

The last item of the PHQ-9 evaluates the frequencyof "thoughts that you would be better off dead or ofhurting yourself in some way." We developed 2 additionalfollow-up questions for patients endorsing thisitem. The first was designed to clarify whether thepatient is experiencing active suicidal ideation ("Arethese thoughts that you would be better off dead, orthoughts of hurting or killing yourself?"). The secondasks about active planning ("Over the past 2 weekshave you thought about specific ways you might hurt orkill yourself?").

Single-Item Screen

. In 1997, the Veterans HealthAdministration (VHA) released clinical practice guidelinesfor major depressive disorder, which includedannual screening for all general medicine patients.4 Atthe Portland VAMC, primary care patients are screenedannually for depression unless they are currentlyundergoing specialty mental health treatment. Thescreening item "Have you been depressed or sad mostof the past year?" uses a yes/no response format andis based on the single-item tested by Williams and hiscolleagues.26

Statistical Analysis

When a patient skipped a single PHQ item (17/1211,or <1.5%), the omitted value was imputed using meansubstitution.30 Imputed data were not used in the analysisof detection of suicidal ideation. Internal consistency(Cronbach's alpha) was calculated by using data frompatients interviewed during the first 5 months ofrecruitment who answered all 9 items (n = 962). Therewere no differences between the 962 (79%) patientsassessed with the full PHQ and the 249 patientsassessed with the abbreviated screen in terms of demographicsor depression severity (ie, the proportion ineach cohort classified as not depressed, mildlydepressed, moderately depressed, etc).


Receiver operating characteristic (ROC) curveanalyses comparing patients screened before andafter the change in PHQ administration procedureshowed no significant differences for the single item,PHQ-2 &#8805; 2, or PHQ-2 &#8805; 3 when the major depressiondiagnosis algorithm, PHQ-9 &#8805; 10, or PHQ-9 &#8805; 15 wasused as the reference standard. Thus, the data werecombined for all subsequent analyses except PHQinter-item correlations. Frequencies, correlations,and tests for differences were used for item-levelanalyses.


To evaluate the VA single-item measure and differentscoring options of the PHQ, we used bivariate analyses(correlation and tests) and ROC analysis. Through ROCanalysis, the sensitivity and specificity of the study measureare assessed using a more established measure of diseasestatus as the reference standard. The area under thecurve (AUC) can range from 0 to 1.0; an AUC of .50 suggeststhat classification based on the instrument understudy is no more accurate than random chance.Analyses were performed using SPSS" version 11.5 forWindows (SPSS Inc, Chicago, Ill); a Web-based clinicalcalculator was used to calculate likelihood ratios.31




Internal consistency for the PHQ-9 was excellent (&#945;= .86).32 No item detracted from the consistency of thescale; inter-item correlations ranged from .27 to .68.The strongest associations were between anhedonia anddepressed mood ( = .68) and self-esteem and depressedmood ( = .62); the psychomotor and self-harm itemshad slightly weaker inter-item associations (rangingfrom .27 to .44) than the other items.

The distribution of PHQ-9 scores (n = 1211) was positivelyskewed (mean = 4.76, median = 2, SD = 6.16).Using the cut scores for depression severity, 436(36.0%) patients had scores indicating at least milddepressive symptoms (PHQ-9 &#8805; 5); 251 (20.7%), at leastmoderate depressive symptoms (PHQ-9 &#8805; 10); 120(9.9%), at least moderately severe depressive symptoms(PHQ-9 &#8805; 15); and 39 (3.2%), severe depressive symptoms(PHQ-9 &#8805; 20). Using the major depression algorithmbased on DSM-IV criteria, 12% of patients met thecriteria for a provisional diagnosis of depression.



Of the 1211 study patients, 973 (80.3%) responded"no" to the VA single-item depression screen and 238(19.7%) responded "yes" (Table 1). Table 1 presentssingle-item depression screen results by depressionseverity based on the PHQ-9. Patients with positive single-item screens had significantly higher PHQ-9 scoresthan those with negative screens (12.90 vs 2.77, =23.36, < .001). Nearly 9 out of 10 patients (89.5%)with a positive single-item screen had PHQ scores indicatingat least mild symptoms of depression. On theother hand, 8.2% of patients with PHQ-9 scores suggestingmoderate to severe depression did not endorse thesingle-item screen.

Table 2 presents sensitivities, specificities, likelihoodratios, and areas under the ROC curve (AUCs) for the VAsingle-item screen and the PHQ-2, with the 3 main scoringalgorithms of the PHQ-9 as reference standards. Foreach standard, PHQ-2 &#8805; 2 demonstrated greater sensitivitythan the VA single-item screen. PHQ-2 &#8805; 3 is moresensitive than the VA-single item when using the majordepression algorithm as the reference standard, but theconfidence intervals slightly overlap at PHQ-9 &#8805; 10 and= 15. In turn, PHQ-2 &#8805; 3 is as sensitive as PHQ-2 = 2 (ie,the confidence intervals overlap), except when screeningfor moderate depression symptoms (PHQ-9 &#8805; 10). Interms of specificity, the VA single-item outperformedPHQ-2 &#8805; 2 but not PHQ-2 &#8805; 3, for which the differencesare not statistically significant. Finally, the AUC forPHQ-2 &#8805; 2 was greater than the AUC for the VA single-itemscreen when using PHQ-9 &#8805; 10 and the majordepression algorithm; the AUC for PHQ-2 &#8805; 3 was greaterthan the AUC for the VA single-item screen when usingthe more stringent standards (PHQ-9 &#8805; 15 and the majordepression algorithm).



Eighty (6.6%) patients rated the item "thoughts thatyou would be better off dead or of hurting yourself" asoccurring at least several days over the past 2 weeks.In response to the follow-up questions, 28 (2.3%)acknowledged thoughts of harming themselves and 16(1.3%) acknowledged having a specific plan. Table 3displays the percentage of patients with potential suicidal ideation who would havebeen identified by the VA single-item screen and by PHQscores using different PHQ formatsand scoring methods. Ofnote, the correlation betweenPHQ-9 scores and PHQ-8(which excludes the death/suicideitem) scores was very high( = .998, < .001, n = 1044);only 3 patients with a PHQ-9score of 10 or higher had aPHQ-8 score of less than 10.


The data presented here suggestthat while the VA singleitemdepression screen is specific, it is only moderatelysensitive when the PHQ-9 cut point for moderatedepression is used as the reference standard. Althoughchanging the definition of a "positive" PHQ-9 score from&#8805;10 to &#8805;15 brings the single item's sensitivity to withinthe range of those recorded for case-finding measures,12raising the bar so that only those with moderatelysevere depression or major depression are detected maybe inappropriate for screening conducted in a primarycare setting.11,13 In comparison to the VA single-itemscreen, the PHQ-2 performed very well. Using a PHQ-2cut point of &#8805; 2 rather than &#8805; 3 improves its sensitivity,but also increased the false-positive rate; using the PHQmajor depression algorithm as a reference standard,58% of those in our study who scored positive at PHQ-2&#8805; 2 were false positives. Whichever cut point is selected,if a very short screen is to be used, our results suggestthat the PHQ-2 surpasses the single-item screen in a VAprimary care setting, particularly in terms of sensitivity.

Notably, our findings for a single-item depressionmeasure differ from those of Williams et al26 (78% vs85% sensitive and 88% vs 66% specific, respectively).The higher sensitivity Williams et al report may stemfrom their use of a clinician interview as the referencestandard, differences in the sample patients' sex andethnicity, or, less likely, the slight difference in wording.In contrast, our findings for the PHQ-2 are moreconsistent with those of Kroenke et al,25 who studied580 patients (66% women, 21% ethnic minority, meanage of 46 years) from community-based primary careand obstetrics-gynecology clinics. Using PHQ-2 &#8805; 3,Kroenke et al reported 83% sensitivity, 92% specificity,and an AUC of .93, again with a structured interview bya mental health clinician as the reference standard. Wefound 97% sensitivity, 91% specificity, and an AUC of.94 with the PHQ major depression algorithm as a referencestandard. These results suggest that the 2-itemscreen is more generalizable across patient populationsthan the 1-item screen and/or that the impact of sampledemographics on brief screen performance may besubstantial. Indeed, Williams et al report predictive differencesby ethnicity, and Kroenke et al found thatpatient age (but not patient sex) affected the results.

In our sample, scores based on the PHQ-8 signaleddepression in all but 1 patient who expressed an activesuicide plan. That is, administering the final PHQ-9item assessing suicidal ideation did not improve case-findingover the PHQ-8. Importantly, however, approximatelyone third of the patients who endorsed thePHQ-9 death or suicide item in our study had activesuicidal ideation and received urgent clinical attention,which would not have occurred had they not beenadministered the item addressing thoughts of death orself-harm. Thus, for clinical purposes we recommendthe following algorithm: if a patient responds affirmativelyto either of the PHQ-2 items, the remaining 7items of the PHQ-9 should be administered. If the PHQ-9 score suggests major depression or suicidal ideation,clinicians must be prepared to conduct further assessmentand to offer or arrange for appropriate treatment.

Our data suggest that 1 of every 3 veterans seen inprimary care who is not already receiving specialtytreatment has symptoms consistent with at least milddepression, 1 in 5 has symptoms consistent with atleast moderate depression, and 1 in 10 has symptomsconsistent with moderately severe depression. Aprevalence of approximately 20% is congruent withprevious estimates from veteran samples.13,34 Yet ourprevalence estimate is alarming in that, in contrast toprevious VA studies,13,33 we excluded patients who hadseen a mental health professional in the last 6 months.



It is important to note several limitations of this study.First, we used the PHQ and not a formal diagnostic interviewas our reference standard. The PHQ, however, hasstrong, well-documented psychometric properties, andour PHQ-2 results are comparable to those of Kroenke etal,25 who used a mental health clinician interview as areference standard. Second, the generalizability of ourresults to nonveteran populations may be limited.Individuals without telephones, who were severelydepressed, or who otherwise were unable to complete aphone screening might be underrepresented. Third, theassociations we detected may be inflated by various factors.For instance, all PHQ-9 items are keyed in the samedirection, which can contribute to response sets andexaggerate internal consistency. Also, the PHQ-2 isdrawn directly from the PHQ-9. However, scores basedon the first 2 items of the PHQ correlate with scoresbased on the last 7 items almost as strongly as with PHQ-9 scores ( = .76 vs = .88), suggesting that the associationbetween the PHQ-2 and the PHQ-9 is not solely anartifact of common items. Finally, we administered thePHQ-9 and the single-item screen concurrently,although the influence of this is likely to be slight.34

If the goal of screening is to identify potential casesof depression, then the following must be considered.Current guidelines suggest that all general medicinepatients not already being seen by mental health professionalsshould be screened, with repeat screening ifrisk factors or symptoms are present and systems arein place to support diagnosis and treatment.4,6 Indepression screening instruments, sensitivity is critical.11,13 Thus, the single VA item is less than optimal.Administering 2 items improves performance with minimaladded time investment, and using PHQ-2 &#8805; 2results in appropriate levels of sensitivity.

Although brief instruments can facilitate screeningprograms in primary care settings, these instrumentsare not sufficient to confirm the diagnosis of depressionor the severity of suicidal ideation. Clinician assessmentmust follow. In making a diagnosis, the clinicianshould take into account the patient's history, comorbidities,functional status, and safety —considerationsprecluded in any brief screening instrument.


We wish to thank Nancy Cuilwik, BS, LeAnn Snodgrass, MeganCrutchfield, BS, and Jeff Solodky, BA, for their help in data organizationand analysis, and manuscript preparation.

From Research Service (KC), Behavior Health and Clinical Neurosciences Division(SKD), and the Division of Hospital and Specialty Medicine (MSG), Portland VA MedicalCenter, Portland, Ore; and the Department of Psychiatry (KC, SKD) and the Department ofMedicine (MSG), Oregon Health & Science University, Portland.

This study was supported by the Department of Veterans Affairs, Veterans HealthAdministration, Health Services Research and Development Service project MHI 20-020-1.The views expressed in this manuscript are those of the authors and do not necessarily representthe views of the Department of Veterans Affairs.

Address correspondence to: Kathryn Corson, PhD, Portland VA Medical Center, POBox 1034 (P3 DEP-PC), Portland, OR 97207. E-mail:


