Identifying Subgroups of Complex Patients With Cluster Analysis

Cluster analysis can aid in identifying subgroups of patients with similar patterns of comorbid conditions for targeted care management.
Published Online: August 09, 2011
Sophia R. Newcomer, MPH; John F. Steiner, MD, MPH; and Elizabeth A. Bayliss, MD, MSPH

Objective: To illustrate the use of cluster analysis for identifying sub-populations of complex patients who may benefit from targeted care management strategies.


Study Design: Retrospective cohort analysis.


Methods: We identified a cohort of adult members of an integrated health maintenance organization who had 2 or more of 17 common chronic medical conditions and were categorized in the top 20% of total cost of care for 2 consecutive years (n = 15,480). We used agglomerative hierarchical clustering methods to identify clinically relevant subgroups based on groupings of coexisting conditions. Ward’s minimum variance algorithm provided the most parsimonious solution.


Results: Ward’s algorithm identified 10 clinically relevant clusters grouped around single or multiple “anchoring conditions.” The clusters revealed distinct groups of patients including: coexisting chronic pain and mental illness, obesity and mental illness, frail elderly, cancer, specific surgical procedures, cardiac disease, chronic lung disease, gastrointestinal bleeding, diabetes, and renal disease. These conditions co-occurred with multiple other chronic conditions. Mental health diagnoses were prevalent (range 28% to 100%) in all clusters.


Conclusions: Data mining procedures such as cluster analysis can be used to identify discrete groups of patients with specific combinations of comorbid conditions. These clusters suggest the need for a range of care management strategies. Although several of our clusters lend themselves to existing care and disease management protocols, care management for other subgroups is less well-defined. Cluster analysis methods can be leveraged to develop targeted care management interventions designed to improve health outcomes.

(Am J Manag Care. 2011;17(8):e324-e332)

This study illustrates the use of cluster analysis to identify sub-populations of complex patients for potential targeted care management within an integrated health maintenance organization.


  • Among a cohort of adults with multimorbidity and high healthcare utilization, we identified 10 clinically relevant clusters of complex patients.


  • While care management protocols may already exist in many healthcare settings for some common clusters, other clusters identified present opportunities for new or enhanced care management.


  • Data mining methods such as cluster analysis can be applied in other settings where electronic diagnosis data are readily available.
By 2020, over 81 million persons in the United States will have 2 or more chronic conditions.1 Multimorbidity results in adverse health outcomes and higher healthcare costs, and challenges current models of care delivery.2,3 Care management has the potential to improve health outcomes for persons with multimorbidities. However, most disease and care management strategies have been developed to improve specific health outcomes for populations defined by single diseases or specific circumstances (such as hospital discharge).4-11 There is a need for strategies that can identify sub-populations with multiple, interacting diseases, in order to provide them with appropriate and relevant care management support.

Investigations to identify these populations of complex patients have traditionally relied upon multivariable regression analyses to identify patient-level characteristics (such as demographics and diseases) that predict the outcome of interest (such as hospitalization).12-14 As compared with investigations that use multivariable regression analyses to identify individual disease predictors of specific outcomes, data mining techniques provide an opportunity to empirically identify groups of patients with similar patterns of multimorbidities. One such technique, cluster analysis, refers to classification methods that are used for discovering groups or “clusters” of “highly similar entities” within data sets.15 Cluster analyses are common in psychology, sociology, and marketing research, and the methods have been used to a limited extent in health services research.16-18 While cluster analyses previously have been used to discover patterns of multimorbidities,19-21 in this study we demonstrate the application of such methods for identifying clusters of patients with high utilization that may suggest opportunities for enhanced care management in a managed care setting.

We used cluster analysis to explore a large, 2-year cohort of health maintenance organization members with 2 or more chronic conditions. We hypothesized that within a large, complex patient population, cluster analysis would reveal groups of patients with distinct patterns of comorbid conditions. Although some of these subgroups would be characterized by well-known patterns of co-occurring medical conditions with established care management strategies, other subgroups would reveal combi-nations of comorbidities that might benefit from new, proactive, and targeted care management.



Kaiser Permanente Colorado (KPCO) is an integrated, not-for-profit health maintenance organization. During the years studied (2006 and 2007), KPCO had approximately 430,000 members. This study was approved by KPCO’s Institutional Review Board.

Study Population

The study population consisted of KPCO members 21 years or older on January 1, 2006, categorized in the top 20% of total cost of care in both 2006 and 2007, each with 2 or more of 17 common chronic medical conditions. Annual cost estimates combined general ledger costs with direct and indirect utilization-related costs to provide cost-of-care estimates for KPCO members.22 We excluded members with a long-term care facility stay, chronic kidney dialysis, or an inpatient visit of greater than 30 days during the 2 years based on the premise that their unique and significant care management needs are likely to already be well defined. Six extremely high cost outliers were also removed from the cohort.

We compiled a list of 17 chronic medical conditions based on prevalence in the general population, prevalence in our specific cohort, and a literature search of conditions likely to predict hospitalization or adverse health outcomes in complex patients.13,23-34 The selected conditions were diabetes, chronic obstructive pulmonary disease (COPD), chronic kidney disease, stroke, obesity, dementia, fall, hip fracture, chronic pain, skin ulcer, orthopedic surgery, back surgery, abdominal surgery, gastrointestinal bleeding, cancer (excluding non-melanoma skin cancer), cardiac disease (which included coronary artery disease and congestive heart failure), and mental health conditions—primarily depression, but also including generalized anxiety and bipolar disorders. Determinations of whether cohort members had a given condition were based on inpatient and outpatient International Classification of Diseases, Ninth Revision (ICD-9) diagnosis and procedure codes in 2006 and 2007. In addition, we used KPCO’s cancer registry to determine cancer diagnoses in 2005, 2006, and 2007. We considered a cohort member to have obesity if they had an ICD-9 diagnosis code for obesity or a median body mass index (BMI) greater than or equal to 30 in 2006 and 2007. BMI data were available for 98.6% of cohort members; if a cohort member did not have a BMI value or an obesity diagnosis in 2006 or 2007 then we did not consider them to be obese.

Statistical Analysis

SAS version 9.2 (SAS Institute, Cary, North Carolina) was used for all analyses. We described demographic attributes, healthcare utilization, comorbidity score (using the Quan adaptation of the Elixhauser comorbidity index),35 and prevalence of clinical conditions within the cohort using frequencies and medians with 25th and 75th percentiles.

Agglomerative Hierarchical Clustering. We used agglomerative hierarchical clustering to identify clinically relevant groups of cohort members with similar multimorbid conditions. With this method of cluster analysis, each cohort member starts as its own cluster. The 2 most similar clusters are merged and this new cluster replaces the 2 former clusters. The process continues until there is only 1 cluster containing all observations.15,36,37 After the clustering algorithm is run, the user must select the appropriate cutoff point for the number of clusters desired based on clinical importance or other pre-specified criteria.

Algorithms. Various algorithms are available for cluster analysis. For this study, we used Ward’s minimum variance method as the primary algorithm. With this algorithm, every possible cluster combination is considered at each step of agglomerative hierarchical clustering, and the combination thatresults in the smallest addition to the error sum of squares is selected.15,37 Ward’s method is a widely used algorithm which minimizes the variance within clusters, and is also known to produce clusters of similar sizes.15,17,18,38-43 We compared results from Ward’s method to results using the flexible beta algorithm, where the user sets different levels of beta, and beta values less than zero optimize the dissimilarity between clusters.19,20,44

Analytic Process. In the analytic data set, the presence or absence of each of the 17 conditions was represented with a 1 or 0 for each cohort member. We first randomly split the full analytic data set into 2 equally sized data sets. We then converted each split data set into a dissimilarity matrix using Jaccard’s coefficient. This is an appropriate distance measure for clinical conditions, as it considers the number of conditions that 2 people have in common and ignores conditions that neither person has.19

We ran our primary algorithm, Ward’s minimum variance method, on both split data sets. The pseudo F, pseudo T, and r2 statistics were examined for different numbers of clusters to identify possible clustering solutions.37 These statistics pointed to several desirable numbers of clusters, and membership in these clusters was described by examining the prevalence of each condition in the cluster. Cluster membership was compared between the 2 split data sets to assess the consistency of the clustering process. Since cluster membership was similar between the 2 split data sets, thus reinforcing the stability of the algorithm in this population, Ward’s algorithm was run on the entire data set. We subjectively determined that a 10-cluster solution produced the most clinically relevant clusters. For comparison, we then produced 10 cluster solutions using the flexible beta method, with beta set at -0.25 and -0.5, and compared these results with Ward’s method. This confirmed that Ward’s algorithm resulting in 10 clusters appeared to be the most parsimonious solution and provided the most clinically relevant groups. We then described the 10 clusters by the number of cohort members in the cluster, median age of cluster members, and percentage of cluster members with the most prevalent conditions in that cluster. We also described relative cost of care ratios for each cluster.


Table 1 provides descriptive demographic and disease characteristics of the study cohort (n = 15,480). The median age of cohort members was 65 years, and 59.1% of cohort members were women. Cohort members had a median of 5 chronic medical conditions (including, but not limited to, the 17 conditions included in the cluster analysis).

PDF is available on the last page.
Adult ADHD Compendium
COPD Compendium
Dermatology Compendium
Diabetes Compendium
GI Compendium
Immuno-oncology Compendium
Lipids Compendium
MACRA Compendium
Oncology Compendium
Pain Compendium
Reimbursement Compendium
Rheumatoid Arthritis Compendium
Know Your News
HF Compendium
Managed Care PODCAST